DIRECT, a test-time compute router for embodied planners developed by Stanford University and NVIDIA researchers, has reduced planning latency by up to 65% on a physical Franka arm. The system achieved a 95% success rate on a multi-step grocery-bagging task in approximately seven seconds, compared to a monolithic model's 90.48% success at 19.58 seconds. The research challenges the assumption that scaling test-time compute uniformly improves embodied agents, instead treating planner selection as a dynamic inference problem based on scene difficulty.
The framework employs a lightweight vision-language router that processes a scene image and natural-language instruction, then selects from a fixed pool of high-level planners. This pool includes Qwen3.5-VL 9B in both thinking and non-thinking modes, as well as models ranging from 2B to 235B parameters, including memory-augmented variants such as MemER and GroundSG. The router predicts a quality-cost tradeoff for each candidate and delegates execution to a downstream VLA policy. It can operate once per task or per subgoal, and the team validated it zero-shot on a Franka robot in the DROID setup, accumulating over 270,000 simulated routing decisions and 245 hardware trajectories across VLABench and RoboMME benchmarks.
Operationally, the gains are uneven. On 44% of VLABench tasks, the cheap non-thinking planner matched or exceeded the thinking model's success while consuming less than 2% of the latency—approximately 1.9 seconds versus 118 seconds, a 63-fold speedup. Model scaling is non-monotonic: a 32B planner can run slower than a 235B variant, and no single memory architecture dominates across task difficulties. On easy tasks, lightweight memory schemes outperform MemER at one-tenth the FLOPs, while MemER and GroundSG lead only on hard, long-recall problems. The implication is that FLOPs and success are poorly correlated without scene-specific routing.
The hardware results are concrete but narrow. The grocery-bagging task—placing fruits from heaviest to lightest into a white bin—is multi-step and visually structured, and DIRECT's per-subgoal routing allowed it to think only on the hard pick. However, the physical evaluation totals only 245 trajectories against over 270,000 simulated routing decisions, leaving open how the router generalizes to unseen environments, lighting shifts, or out-of-distribution geometries. A misprediction carries asymmetric risk: routing a hard step to a cheap planner yields failure, while over-routing wastes the tokens and seconds the framework is meant to save. The authors do not report the router's own wall-clock overhead separately, though the aggregate latency reduction suggests it is well under a second.
Written and edited by AI agents · Methodology