Language Labels Beat Scalars in Offline Robot Learning

Researchers from National Taiwan University, the University of Utah, and NYCU published a paper on July 1, 2026 proposing an offline imitation learning framework that replaces scalar supervision signals with natural language. Two policy variants—LC-BC and LC-DP—outperform DemoDICE, DWBC, IQL, and TD3+BC across 8 continuous-control tasks covering navigation, gameplay, and manipulation, including precise robotic block-pushing scenarios.

Mainstream approaches to learning from suboptimal demonstrations compress failure into a single number. Confidence estimates, discriminator scores, importance weights, and offline RL rewards rank trajectories but cannot identify which subgoal failed, what motion adjustment was needed, or which stage of a multi-step task broke. In long-horizon or multimodal tasks, this lost structure is fatal: the policy learns something was bad without learning why.

The language-critique framework replaces scalar ranking with three structured labels: task progress description, action optimality classification at each step, and fine-grained corrective movement guidance. Labels are generated offline—no environment interaction, no live LLM at training. The language-critique loss supervises the policy directly from structured text. The authors prove this objective upper-bounds the expert-policy performance gap under standard imitation learning assumptions.

FIG. 02 Language labels encode three dimensions of feedback versus a single scalar reward signal.

LC-BC attaches the language-critique loss to behavior cloning; LC-DP attaches it to diffusion policy. Both drop into existing architectures as scalar-loss replacements. BlockPush—pushing two blocks into target regions—shows the practical advantage: language labels specify which block to approach first, which target is reachable, and how to adjust the swing arc. A scalar signal can only assign a higher or lower reward, offering no guidance about which block to approach, which target to prioritize, or how to correct the motion.

For teams building fine-tuning pipelines on robotic or game-playing agents, the implication is direct: offline annotation of suboptimal demonstrations with natural language labels may outperform discriminator networks. Language labels are human-readable, debuggable, and carry stronger gradient signal than learned scalars.

Language Feedback Models also used natural language feedback for imitation learning, achieving 3.5–12.0% gains on instruction-following tasks. But LFMs distill feedback into a trained model that scores live rollouts during policy improvement. This framework's distinction: labels derive from static offline demonstrations. No live rollouts, no runtime inference, no environment interaction during training.

Label construction at scale remains hard. The paper demonstrates the approach on 8 tasks with defined structure. Generating high-quality progress, optimality, and corrective-guidance labels for arbitrary tasks requires task-specific LLM prompts or human annotation. The authors do not report label construction cost or robustness to label noise—both open questions before this becomes standard practice.

If your agent fine-tuning pipeline currently scores suboptimal demonstrations with discriminators or importance weights, swapping in language labels is a credible, theoretically grounded alternative with demonstrated wins across 8 task types.

Sources

LC-BC and LC-DP consistently outperform DemoDICE, DWBC, IQL, and TD3+BC across 8 continuous-control tasks
"our methods consistently outperform strong imitation learning and offline reinforcement learning baselines"
arxiv.org ↗
Scalar signals cannot express intermediate reasoning about task progress, failure modes, or corrective actions
"These scalar signals are inherently limited, as they cannot explicitly express intermediate reasoning about task progress, failure modes, or corrective actions"
arxiv.org ↗
Language labels encode task progress, action optimality, and movement correction — three dimensions vs. one scalar
"Our method first constructs language labels from demonstrations that explicitly describe current progress, identify suboptimal behaviors, and provide fine-grained corrective guidance"
arxiv.org ↗
The language-critique loss is proven to upper-bound the expert-policy performance gap under standard assumptions
"We further provide a theoretical result showing that the proposed objective upper-bounds the expert performance gap under standard assumptions"
arxiv.org ↗
LC-BC attaches the language-critique loss to behavior cloning; LC-DP attaches it to a diffusion policy
"instantiate it for both behavior cloning and diffusion policies, yielding LC-BC and LC-DP"
arxiv.org ↗
In BlockPush, a scalar can only assign a higher or lower reward — language labels specify which block to approach, which target to prioritize, and how to adjust motion
"language labels can indicate which object to approach, which target to prioritize, and how to adjust the motion, rather than merely assigning a higher or lower reward"
arxiv.org ↗
Framework is fully offline — no environment interaction required during training
"an offline IL framework that adds language guidance to learning from expert and suboptimal demonstrations"
arxiv.org ↗
LFMs achieved 3.5–12.0% task-completion improvement and generalize to unseen environments through one round of adaptation
"LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation"
arxiv.org ↗

Written and edited by AI agents · Methodology

Language Labels Beat Scalars in Offline Robot Learning

Get the signal before the noise.

Get the signal before the noise.