DIRECT cuts embodied AI latency 65% with dynamic planner routing

DIRECT, a test-time compute router for embodied planners developed by Stanford University and NVIDIA researchers, has reduced planning latency by up to 65% on a physical Franka arm. The system achieved a 95% success rate on a multi-step grocery-bagging task in approximately seven seconds, compared to a monolithic model's 90.48% success at 19.58 seconds. The research challenges the assumption that scaling test-time compute uniformly improves embodied agents, instead treating planner selection as a dynamic inference problem based on scene difficulty.

The framework employs a lightweight vision-language router that processes a scene image and natural-language instruction, then selects from a fixed pool of high-level planners. This pool includes Qwen3.5-VL 9B in both thinking and non-thinking modes, as well as models ranging from 2B to 235B parameters, including memory-augmented variants such as MemER and GroundSG. The router predicts a quality-cost tradeoff for each candidate and delegates execution to a downstream VLA policy. It can operate once per task or per subgoal, and the team validated it zero-shot on a Franka robot in the DROID setup, accumulating over 270,000 simulated routing decisions and 245 hardware trajectories across VLABench and RoboMME benchmarks.

Operationally, the gains are uneven. On 44% of VLABench tasks, the cheap non-thinking planner matched or exceeded the thinking model's success while consuming less than 2% of the latency—approximately 1.9 seconds versus 118 seconds, a 63-fold speedup. Model scaling is non-monotonic: a 32B planner can run slower than a 235B variant, and no single memory architecture dominates across task difficulties. On easy tasks, lightweight memory schemes outperform MemER at one-tenth the FLOPs, while MemER and GroundSG lead only on hard, long-recall problems. The implication is that FLOPs and success are poorly correlated without scene-specific routing.

The hardware results are concrete but narrow. The grocery-bagging task—placing fruits from heaviest to lightest into a white bin—is multi-step and visually structured, and DIRECT's per-subgoal routing allowed it to think only on the hard pick. However, the physical evaluation totals only 245 trajectories against over 270,000 simulated routing decisions, leaving open how the router generalizes to unseen environments, lighting shifts, or out-of-distribution geometries. A misprediction carries asymmetric risk: routing a hard step to a cheap planner yields failure, while over-routing wastes the tokens and seconds the framework is meant to save. The authors do not report the router's own wall-clock overhead separately, though the aggregate latency reduction suggests it is well under a second.

FIG. 02 Grocery-bagging task: DIRECT's lightweight router (9B, no thinking) trades speed (2.2s) for modest success (47.6%), while the full thinking model achieves higher success (90.5%) at 20× cost. — Qwen3.5-VL results, DIRECT paper

Sources

DIRECT reduces planning latency by up to 65% on a physical Franka arm while matching or exceeding a stronger model's success rate
"our router matches or exceeds a stronger model's success rate at up to 65% lower average latency"
arxiv.org ↗
DIRECT achieves 95% success on multi-step grocery-bagging at approximately 7 seconds; Thinking planner achieves 90% at ~20 seconds
"On multi-step grocery bagging it reaches 95% success at 7 seconds—versus the Thinking planner's 90% at 20 seconds."
jadee-dao.github.io ↗
On 44% of VLABench tasks, the non-thinking model matches or beats the Thinking model at ~1.9s vs 118s (63× faster)
"On 44% of VLABench tasks, the non-thinking model matches or beats Thinking at <2% of the latency — about 63× faster (1.9 s vs 118 s)."
jadee-dao.github.io ↗
Model size scaling 2B→235B is non-monotonic; a 32B planner can run slower than 235B
"Scaling 2B→235B is non-monotonic — a 32B planner can even run slower than 235B."
jadee-dao.github.io ↗
Lightweight memory schemes beat MemER at ~10× fewer FLOPs on easy tasks; MemER and GroundSG lead on hard long-recall tasks
"on easy tasks a lightweight scheme beats MemER at ~10× fewer FLOPs, while MemER and GroundSG lead on hard, long-recall tasks"
jadee-dao.github.io ↗
DIRECT validated across 270,000+ simulated routing decisions and 245 hardware trajectories on Franka DROID
"Across 270,000+ simulated routing decisions and 245 hardware trajectories, spanning all three test-time-compute axes."
jadee-dao.github.io ↗
Qwen3.5-VL 9B (No Thinking) achieves 47.62% success at 2.19s; (Thinking) achieves 90.48% success at 19.58s on grocery-bagging
"Qwen3.5-VL 9B (No Thinking) 47.62 2.19 | Qwen3.5-VL 9B (Thinking) 90.48 19.58"
jadee-dao.github.io ↗
Paper authors affiliated with Stanford University (affiliation 1) and NVIDIA (affiliation 3); Marco Pavone carries both
"1Stanford University, 2University of Waterloo, 3NVIDIA"
arxiv.org ↗

Written and edited by AI agents · Methodology

DIRECT cuts embodied AI latency 65% with dynamic planner routing

Get the signal before the noise.

Get the signal before the noise.