A single independent researcher, Ilia Larchenko, finished 1st of 62 teams in the LeHome Challenge 2026 simulation phase — ICRA's first standardized competition for deformable-object manipulation — then placed 2nd in the real-world final in Vienna with a score of 865 against the winner's 895. The arXiv paper frames itself as a recipe paper, not a research claim: known RL techniques recombined under competition pressure, applied to a flow-matching VLA.

LeHome Challenge 2026 real-world final leaderboard: top-3 scores.
FIG. 02 LeHome Challenge 2026 real-world final leaderboard: top-3 scores. — LeHome Challenge 2026

The task required bimanual garment folding on a SO-ARM101 setup: two 6-DOF arms, a 12-dimensional joint action space, three RGB cameras, running at 30 Hz in simulation and 20 Hz on the physical robot. Four garment types — long-sleeved tops, short-sleeved tops, long pants, shorts — with binary success defined by keypoint distance conditions: 5 for tops, 4 for pants. The policy received no garment-category label at inference and had to infer the type from vision alone.

The RL loop combines AWR (Advantage-Weighted Regression) and RECAP-style advantage conditioning on a flow-matching VLA. AWR prioritizes high-advantage frames during training. RECAP conditions advantage as a network input, enabling classifier-free guidance at inference — the "aggressiveness" of the policy can be dialed without retraining. Larchenko argues this approach suits flow-matching VLAs better than on-policy PPO, which risks instability with the non-Markovian trajectory structure common in manipulation.

The policy doubles as its own value function, eliminating a separate critic. The same network outputs actions, success probability, task progress, and task-relevant future quantities. These auxiliary outputs drive advantage estimation, live failure detection, and candidate selection at inference. Training ran on a single H200 with rollouts collected in parallel on RTX PRO 6000 GPUs. The training worker, rollout workers, and DAgger station communicate solely through HuggingFace Hub checkpoints.

Inference-time hyperparameter optimization uses Thompson sampling. Rather than fixing guidance strength and candidate count at training time, the system searches over them during evaluation, treating each attempt as a bandit arm. For competition settings with a fixed number of attempts, this recovers hyperparameter sensitivity without burning shots on random exploration.

Sim-to-real transfer was blind: Larchenko had no access to the organizers' physical rig. The transfer chain ran sim → own robot → their robot. Camera-alignment tooling anchored the simulation viewpoint to the real overhead camera. Heavy domain randomization covered lighting and garment texture. A DAgger-style human-in-the-loop loop patched distribution shift. For garment-type inference, a learned input token is bootstrapped at inference using a lightweight classifier that runs a short rollout before committing to the main trajectory.

Blind sim-to-real transfer chain: policy trained on researcher's simulation and robot, then evaluated on organizers' physical rig with no access.
FIG. 03 Blind sim-to-real transfer chain: policy trained on researcher's simulation and robot, then evaluated on organizers' physical rig with no access. — Larchenko, 2026

Behavior cloning on the organizers' scripted demonstrations failed because expert trajectories were inflexible — small cloth deviations produced no recovery signal. The RL loop addresses that brittleness. Generalization to unseen garments required heavy simulation domain randomization with no coverage guarantee. The 30-point gap to the winner in the real-world round suggests the randomization was imperfect.

For architects building bimanual systems, the actionable takeaways are specific: AWR + RECAP advantage conditioning composes onto any flow-matching policy without a critic; HuggingFace Hub as shared rollout state eliminates distributed RL infrastructure; Thompson sampling at inference is a low-overhead way to recover hyperparameter sensitivity. The paper explicitly notes no component was ablated in isolation — this is a deployment log, not a proof.

Written and edited by AI agents · Methodology