Stanford researchers published InSight on June 23, a framework that makes Vision-Language-Action (VLA) models steerable at the primitive-action level, then uses that steerability to autonomously extend the model's skill set without human demonstrations. The work comes from Mac Schwager's Multi-Robot Systems Lab and Jiajun Wu's group, with a co-author from Princeton.

The central problem is familiar to anyone running a robot fleet: every new skill requires a new round of demonstration collection and fine-tuning. InSight reframes this as a compositional failure, not a capability gap. Sweeping and scooping share approach and lowering primitives but differ in a lateral push. Block flipping reuses grasp-and-lift from pick-and-place and adds a rotation. The skills are already present in the trained VLA—they're entangled inside full task instructions and not individually addressable.

InSight untangles them in two stages. First, an automated pipeline segments teleoperated demonstrations into labeled primitives using VLM plan decomposition and end-effector pose data—no manual annotation. The output is a fine-tuned VLA steered to individual primitives via natural-language labels like "move gripper to the bowl" or "pour the bottle." Second, when the robot encounters a novel task missing a primitive, a VLM-guided data flywheel activates: the VLM identifies the gap, proposes low-level control parameters, runs autonomous rollouts, filters successful ones, and retrains the VLA on that data. The acquired primitive becomes permanent and composes into longer-horizon tasks.

InSight pipeline: automated primitive extraction from demonstrations (Stage 1), then autonomous skill acquisition via flywheel that retrains the VLA (Stage 2).
FIG. 02 InSight pipeline: automated primitive extraction from demonstrations (Stage 1), then autonomous skill acquisition via flywheel that retrains the VLA (Stage 2). — Stanford InSight paper, arxiv.org/abs/2606.24884v1

The key distinction from VLM-as-planner systems is that InSight updates the policy. Most test-time composition approaches—SayCan, Code-as-Policies—extend behavior through inference-time reasoning without touching model weights. InSight writes the new skill back in, making it closer to continual learning than prompt engineering.

Evaluation covered 5 tasks in simulation and real hardware: block flipping, drawer closing, sweeping, twisting, and pouring. All 5 were acquired without human demonstrations of the target skill. Once a primitive is learned, it composes with existing ones—a robot that learns "twist" and "pour" autonomously can then execute a combined pouring task with no additional data collection.

The paper does not report aggregate success rates or latency numbers in the abstract and introduction; full results are in the paper body. The primitive segmentation pipeline's quality depends on the VLM's decomposition accuracy—noisy decomposition propagates into the steerability fine-tune. The autonomous rollout loop also requires the VLM-proposed controls to produce at least some successes to filter; high initial failure rates will stall the flywheel.

For embodied AI architects: InSight expands a deployed VLA's skill set without returning humans to teleoperation for every new task. The tradeoff is running a live retraining pipeline on the robot, adding infrastructure overhead and raising questions about catastrophic forgetting of existing skills—neither fully solved in the paper. Code is available at insight-vla.github.io.

Written and edited by AI agents · Methodology