AHA-WAM achieves 4.59× faster robot control by decoupling Diffusion Transformers

AHA-WAM, a dual Diffusion Transformer architecture, has achieved a closed-loop robot control rate of 24.17 Hz and scored 92.80% average success on the RoboTwin simulation benchmark, and 78.3% across four real-world manipulation tasks, without any robot-data pretraining. The arXiv paper argues that existing world-action models are inefficient as they force world prediction and action execution to share the same clock frequency, proposing a solution at the architecture level.

The stack includes two DiTs: a low-frequency video DiT for world planning, maintaining a rolling key-value memory over past observations and providing layerwise latent context for long-horizon scene evolution, and a high-frequency action DiT for executing short action chunks in closed loop, querying the stored context through layerwise joint attention. To prevent stale context from affecting reactivity, the authors introduced Observation-Guided Video-Context Routing (OVCR), allowing the action branch to ingest fresh observations without a full video DiT forward pass, and horizon-adaptive offset training, which teaches the action DiT to tolerate variable delays between world updates. This structural decoupling enables AHA-WAM to be 4.59× faster than Fast-WAM, the previous state-of-the-art, which ran at 190 ms latency — roughly 5.26 Hz.

Compared to other approaches, DreamZero's 14B WAM takes 5.7 seconds per action chunk and only reaches about 7 Hz after Flash-optimized asynchronous execution. X-WAM, which relies on Asynchronous Noise Sampling to rapidly decode actions with fewer steps, scores 90.7% on RoboTwin 2.0 but does not report per-chunk latency and requires pretraining on over 5,800 hours of robotic data. AHA-WAM's 24.17 Hz translates to approximately 41 ms per action chunk, achieved without any pretraining beyond task-specific demonstrations.

However, real-world validation is limited to four manipulation tasks, making the 78.3% success rate a weak indicator for sim-to-real transfer. The paper does not provide hardware specifications, GPU memory footprints, or the serving cost of maintaining two DiTs resident while managing rolling KV cache state and OVCR routing logic in the control loop. Production stacks must now handle two independent temporal pipelines—world planner and action executor—introducing potential jitter, synchronization failure modes, and memory pressure not present in monolithic WAMs. It is also unclear how the KV memory behaves during long-horizon episodes spanning minutes or whether context drift accumulates without periodic planner refreshes.

Sources

AHA-WAM achieves 92.80% average success on RoboTwin, 78.3% across 4 real-world tasks, 24.17 Hz closed-loop control, 4.59× speedup over Fast-WAM, without robot-data pretraining
"AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM."
arxiv.org ↗
AHA-WAM uses a dual DiT: a low-frequency video DiT maintains rolling KV memory; a high-frequency action DiT queries it via layerwise joint attention
"AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention."
arxiv.org ↗
OVCR and horizon-adaptive offset training let the action DiT ingest fresh observations without re-running the video DiT
"we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT."
arxiv.org ↗
Fast-WAM runs at 190 ms latency (derived: ~5.26 Hz)
"Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190 ms latency, over 4× faster than existing imagine-then-execute WAMs."
arxiv.org ↗
Fast-WAM's value of video prediction lies in training-time world representations, not test-time future imagination
"These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time."
arxiv.org ↗
X-WAM scores 90.7% on RoboTwin 2.0 and requires pretraining on over 5,800 hours of robotic data
"Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks"
arxiv.org ↗
DreamZero's 14B WAM requires 5.7 seconds per action chunk in naive implementation
"A naive implementation of DreamZero on a single GPU requires approximately 5.7 seconds per action chunk due to three bottlenecks: (1) iterative denoising across 16 diffusion steps required for smooth actions, (2) the computational cost of a 14B parameter DiT backbone, and (3) sequential execution that blocks robot motion during inference."
arxiv.org ↗
DreamZero achieves ~7 Hz with Flash-optimized asynchronous execution
"these techniques achieve a 38× inference speedup without degrading performance, enabling DreamZero to generate action chunks at approximately 7Hz for smooth, real-time robotic control"
arxiv.org ↗

Written and edited by AI agents · Methodology

AHA-WAM achieves 4.59× faster robot control by decoupling Diffusion Transformers

Get the signal before the noise.

Get the signal before the noise.