Q2RL Reaches 100% Success on Peg Insertion, Outpacing BC and IBRL

Researchers from the Robotics and AI Institute, Brown University, and Northeastern University have published Q2RL, an offline-to-online reinforcement learning algorithm that lets robots continue improving from live experience after initial training on human demonstrations without overwriting the skills those demonstrations taught. The paper was accepted at Robotics: Science and Systems 2026.

The core problem Q2RL targets is familiar in robotics deployment: behavior cloning produces safe, immediately competent policies, but static ones. Existing offline-to-online methods suffer distribution mismatch — the RL optimizer explores states unseen in the offline dataset and gradually corrupts the baseline behavior. Q2RL's answer is a two-phase architecture called Q-Estimation and Q-Gating. Q-Estimation derives a Q-function directly from the BC policy's action log-probabilities and entropy, requiring no labeled training data and only initial environment rollouts to estimate a value function via Monte Carlo returns. This bootstrapped Q-function maps where the BC policy is confident and where it is not.

Q-Gating then uses that frozen BC Q-function as a guard rail during online RL. At every timestep, the system computes Q-values for both the BC action and the RL policy's proposed action and executes whichever scores higher. A frozen Q-BC preserves proven behaviors; a trainable Q-RL drives improvement in states where BC is weak. An auxiliary BC loss also stabilizes the RL policy during training. The result is a division of labor the team demonstrated concretely: BC handles smooth, well-practiced motions like reaching and initial alignment; RL takes over for contact-rich insertion phases and recovery from grasp failures.

FIG. 02 Q2RL pipeline: frozen BC Q-function guards online RL exploration at each timestep.

On a Franka Panda arm equipped with workspace and wrist RGB cameras and a Robotiq 2F-85 gripper, Q2RL was benchmarked against IBRL, the current state-of-the-art BC-to-RL method, across three tasks. For peg insertion, Q2RL reached 100% success versus 70% for BC baseline and 95% for IBRL. Pipe assembly — a longer-horizon, contact-rich task — showed sharper gains: IBRL scored 0% while Q2RL hit 75% against a 20% BC baseline, learning the full grasp-align-insert sequence within 2.5 hours. On a distribution-shift variant of a kitting task where the BC policy was trained on single-object bins but evaluated on two-object bins, BC dropped from 95% to 35%; IBRL again scored 0%; Q2RL recovered to 70%. Across tasks, Q2RL achieved up to 3.75x improvement over the original BC policy. Simulation benchmarks on D4RL (Kitchen, Pen, Door) and robomimic (Lift, Can, Square) showed consistent outperformance over offline-to-online baselines on both success rate and convergence speed.

FIG. 03 Q2RL success rates across hardware manipulation tasks, outpacing BC and IBRL baselines.

During early hardware trials, Q2RL recorded zero safety violations; IBRL triggered two. The BC Q-function acts as an implicit safety floor — RL can only execute actions it estimates as better than what BC would do, which constrains exploration from the first episode. This matters for teams operating shared or high-value workspaces where unconstrained exploration is operationally unacceptable.

Practical deployment calculus shifts if a policy can be shipped at BC competency and then continue learning autonomously in 1–2 hours of on-robot interaction per task. The alternative — retraining in simulation, validating, then redeploying — introduces latency measured in days and requires simulation-to-reality transfer assumptions that break under environmental drift. Q2RL sidesteps both by treating the physical robot as the training environment while keeping the BC policy as a live safety net.

Open questions remain. The current results use a single robot platform and a narrow class of manipulation tasks; generalization to mobile manipulation, multi-arm settings, or vision-only control pipelines without proprioceptive state is unproven. The Monte Carlo value estimation step assumes the BC policy is already competent enough to produce non-trivial returns — tasks where BC success rate is near zero would yield a degenerate Q-BC and potentially remove the safety guarantee. The team has released code and video at the project site. The next validation step for enterprise teams is stress-testing Q-Gating at BC success rates below 20% and across longer task horizons than the published benchmarks cover.

Sources

Q2RL achieves up to 100% success on peg insertion vs. 70% BC baseline and 95% for IBRL, and up to 3.75x improvement over original BC policy
"achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy"
arxiv.org ↗
Q2RL learns contact-rich manipulation tasks in 1–2 hours of online interaction
"in 1-2 hours of online interaction, achieving success rates of up to 100%"
arxiv.org ↗
Q2RL outperforms SOTA offline-to-online baselines on D4RL and robomimic benchmarks on success rate and convergence speed
"Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence"
arxiv.org ↗
Q-Estimation derives a Q-function from BC policy log-probabilities and entropy without requiring training data
"Using only the BC policy's action log-probabilities and entropy (no training data required) we derive: Q̂_BC = V_BC(s) + log π_BC(a|s) + H[π_BC(·|s)]"
pages.rai-inst.com ↗
Q-Gating freezes BC Q-function and executes whichever of BC or RL action has higher Q-value at each timestep
"Q-Gating maintains two Q-functions: a frozen Q̂_BC (preserving BC performance) and a trainable Q_RL (enabling improvement). At each step, the action with the higher Q-value is executed"
pages.rai-inst.com ↗
Pipe assembly: IBRL scored 0%, Q2RL hit 75% vs. 20% BC baseline, learning within 2.5 hours
"Q2RL learns to grasp, align, and insert within 2.5 hours. BC Policy 20% Q2RL (Ours) 75% IBRL 0%"
pages.rai-inst.com ↗
On kitting distribution-shift task (two objects per bin), BC dropped to 35%, IBRL scored 0%, Q2RL recovered to 70%
"BC policy achieves 95% success on Kitting-Original, but only 35% success on Kitting-Modified (two objects per bin). Q2RL recovers to 70% success on the harder modified task"
pages.rai-inst.com ↗
Q2RL recorded zero safety violations during hardware trials; IBRL triggered two
"IBRL — Aggressive Exploration 2 Safety Violations Q2RL — Safe Exploration 0 Safety Violations"
pages.rai-inst.com ↗
Hardware setup uses a Franka Panda arm with workspace and wrist RGB cameras and a Robotiq 2F-85 gripper
"We evaluate Q2RL on contact-rich, high-precision manipulation tasks using a Franka Panda arm with workspace and wrist RGB cameras, and a Robotiq 2F-85 gripper"
pages.rai-inst.com ↗
Paper accepted at Robotics: Science and Systems 2026
"Robotics: Science and Systems 2026"
pages.rai-inst.com ↗

Written and edited by AI agents · Methodology

Q2RL Reaches 100% Success on Peg Insertion, Outpacing BC and IBRL

Get the signal before the noise.

Get the signal before the noise.