InSight Enables Robots to Autonomously Learn New Tasks

Stanford researchers published InSight on June 23, a framework that makes Vision-Language-Action (VLA) models steerable at the primitive-action level, then uses that steerability to autonomously extend the model's skill set without human demonstrations. The work comes from Mac Schwager's Multi-Robot Systems Lab and Jiajun Wu's group, with a co-author from Princeton.

The central problem is familiar to anyone running a robot fleet: every new skill requires a new round of demonstration collection and fine-tuning. InSight reframes this as a compositional failure, not a capability gap. Sweeping and scooping share approach and lowering primitives but differ in a lateral push. Block flipping reuses grasp-and-lift from pick-and-place and adds a rotation. The skills are already present in the trained VLA—they're entangled inside full task instructions and not individually addressable.

InSight untangles them in two stages. First, an automated pipeline segments teleoperated demonstrations into labeled primitives using VLM plan decomposition and end-effector pose data—no manual annotation. The output is a fine-tuned VLA steered to individual primitives via natural-language labels like "move gripper to the bowl" or "pour the bottle." Second, when the robot encounters a novel task missing a primitive, a VLM-guided data flywheel activates: the VLM identifies the gap, proposes low-level control parameters, runs autonomous rollouts, filters successful ones, and retrains the VLA on that data. The acquired primitive becomes permanent and composes into longer-horizon tasks.

FIG. 02 InSight pipeline: automated primitive extraction from demonstrations (Stage 1), then autonomous skill acquisition via flywheel that retrains the VLA (Stage 2). — Stanford InSight paper, arxiv.org/abs/2606.24884v1

The key distinction from VLM-as-planner systems is that InSight updates the policy. Most test-time composition approaches—SayCan, Code-as-Policies—extend behavior through inference-time reasoning without touching model weights. InSight writes the new skill back in, making it closer to continual learning than prompt engineering.

Evaluation covered 5 tasks in simulation and real hardware: block flipping, drawer closing, sweeping, twisting, and pouring. All 5 were acquired without human demonstrations of the target skill. Once a primitive is learned, it composes with existing ones—a robot that learns "twist" and "pour" autonomously can then execute a combined pouring task with no additional data collection.

The paper does not report aggregate success rates or latency numbers in the abstract and introduction; full results are in the paper body. The primitive segmentation pipeline's quality depends on the VLM's decomposition accuracy—noisy decomposition propagates into the steerability fine-tune. The autonomous rollout loop also requires the VLM-proposed controls to produce at least some successes to filter; high initial failure rates will stall the flywheel.

For embodied AI architects: InSight expands a deployed VLA's skill set without returning humans to teleoperation for every new task. The tradeoff is running a live retraining pipeline on the robot, adding infrastructure overhead and raising questions about catastrophic forgetting of existing skills—neither fully solved in the paper. Code is available at insight-vla.github.io.

Sources

InSight makes VLAs steerable at the primitive-action level (e.g., 'move gripper to the bowl', 'lift upward', 'pour the bottle') and autonomously acquires new skills without human demonstrations of target tasks
"InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., 'move gripper to the bowl', 'lift upward', 'pour the bottle')"
arxiv.org ↗
InSight's segmentation pipeline decomposes teleoperated demonstrations into labeled primitives without manual annotation using VLM plan decomposition and end-effector poses
"An automatic primitive segmentation pipeline that decomposes teleoperated demonstrations into labeled primitives without manual annotation, enabling primitive-level VLA steerability."
arxiv.org ↗
Manipulation skills are inherently compositional — sweeping and scooping share approach and lowering primitives but differ in the lateral pushing primitive; block flipping reuses grasp-and-lift but adds a rotation primitive
"sweeping and scooping share approach and lowering primitives, but differ in the lateral pushing primitive. Similarly, flipping a block reuses the same grasp-and-lift sequence from pick-and-place but adds a rotation primitive not present in those demonstrations."
arxiv.org ↗
InSight's VLM-guided data flywheel identifies missing primitives, generates autonomous rollouts, and retrains the VLA — updating policy weights rather than just reasoning at inference time
"We propose a different role for the VLM: not only as a test-time planner over existing skills, but as an active agent for identifying missing primitives, generating successful robot rollouts, and adding those rollouts back to the VLA by retraining to extend its skill capabilities."
arxiv.org ↗
InSight was evaluated across 5 tasks — block flipping, drawer closing, sweeping, twisting, and pouring — without any human demonstrations of the target skills
"We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills."
arxiv.org ↗
Once acquired, new primitives can be composed to execute novel long-horizon tasks without additional human demonstrations
"Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations."
arxiv.org ↗
The work is from Stanford's Multi-Robot Systems Lab (Mac Schwager) and Jiajun Wu's group, with a co-author from Princeton
"Maggie Wang, Lars Osterberg, Stephen Tian, Ola Shorinwa, Jiajun Wu, Mac Schwager — Stanford University, Princeton University"
arxiv.org ↗

Written and edited by AI agents · Methodology

InSight Enables Robots to Autonomously Learn New Tasks

Get the signal before the noise.

Get the signal before the noise.