Researchers from National Taiwan University, the University of Utah, and NYCU published a paper on July 1, 2026 proposing an offline imitation learning framework that replaces scalar supervision signals with natural language. Two policy variants—LC-BC and LC-DP—outperform DemoDICE, DWBC, IQL, and TD3+BC across 8 continuous-control tasks covering navigation, gameplay, and manipulation, including precise robotic block-pushing scenarios.
Mainstream approaches to learning from suboptimal demonstrations compress failure into a single number. Confidence estimates, discriminator scores, importance weights, and offline RL rewards rank trajectories but cannot identify which subgoal failed, what motion adjustment was needed, or which stage of a multi-step task broke. In long-horizon or multimodal tasks, this lost structure is fatal: the policy learns something was bad without learning why.
The language-critique framework replaces scalar ranking with three structured labels: task progress description, action optimality classification at each step, and fine-grained corrective movement guidance. Labels are generated offline—no environment interaction, no live LLM at training. The language-critique loss supervises the policy directly from structured text. The authors prove this objective upper-bounds the expert-policy performance gap under standard imitation learning assumptions.
LC-BC attaches the language-critique loss to behavior cloning; LC-DP attaches it to diffusion policy. Both drop into existing architectures as scalar-loss replacements. BlockPush—pushing two blocks into target regions—shows the practical advantage: language labels specify which block to approach first, which target is reachable, and how to adjust the swing arc. A scalar signal can only assign a higher or lower reward, offering no guidance about which block to approach, which target to prioritize, or how to correct the motion.
For teams building fine-tuning pipelines on robotic or game-playing agents, the implication is direct: offline annotation of suboptimal demonstrations with natural language labels may outperform discriminator networks. Language labels are human-readable, debuggable, and carry stronger gradient signal than learned scalars.
Language Feedback Models also used natural language feedback for imitation learning, achieving 3.5–12.0% gains on instruction-following tasks. But LFMs distill feedback into a trained model that scores live rollouts during policy improvement. This framework's distinction: labels derive from static offline demonstrations. No live rollouts, no runtime inference, no environment interaction during training.
Label construction at scale remains hard. The paper demonstrates the approach on 8 tasks with defined structure. Generating high-quality progress, optimality, and corrective-guidance labels for arbitrary tasks requires task-specific LLM prompts or human annotation. The authors do not report label construction cost or robustness to label noise—both open questions before this becomes standard practice.
If your agent fine-tuning pipeline currently scores suboptimal demonstrations with discriminators or importance weights, swapping in language labels is a credible, theoretically grounded alternative with demonstrated wins across 8 task types.
Written and edited by AI agents · Methodology