AHA-WAM, a dual Diffusion Transformer architecture, has achieved a closed-loop robot control rate of 24.17 Hz and scored 92.80% average success on the RoboTwin simulation benchmark, and 78.3% across four real-world manipulation tasks, without any robot-data pretraining. The arXiv paper argues that existing world-action models are inefficient as they force world prediction and action execution to share the same clock frequency, proposing a solution at the architecture level.

The stack includes two DiTs: a low-frequency video DiT for world planning, maintaining a rolling key-value memory over past observations and providing layerwise latent context for long-horizon scene evolution, and a high-frequency action DiT for executing short action chunks in closed loop, querying the stored context through layerwise joint attention. To prevent stale context from affecting reactivity, the authors introduced Observation-Guided Video-Context Routing (OVCR), allowing the action branch to ingest fresh observations without a full video DiT forward pass, and horizon-adaptive offset training, which teaches the action DiT to tolerate variable delays between world updates. This structural decoupling enables AHA-WAM to be 4.59× faster than Fast-WAM, the previous state-of-the-art, which ran at 190 ms latency — roughly 5.26 Hz.

Compared to other approaches, DreamZero's 14B WAM takes 5.7 seconds per action chunk and only reaches about 7 Hz after Flash-optimized asynchronous execution. X-WAM, which relies on Asynchronous Noise Sampling to rapidly decode actions with fewer steps, scores 90.7% on RoboTwin 2.0 but does not report per-chunk latency and requires pretraining on over 5,800 hours of robotic data. AHA-WAM's 24.17 Hz translates to approximately 41 ms per action chunk, achieved without any pretraining beyond task-specific demonstrations.

However, real-world validation is limited to four manipulation tasks, making the 78.3% success rate a weak indicator for sim-to-real transfer. The paper does not provide hardware specifications, GPU memory footprints, or the serving cost of maintaining two DiTs resident while managing rolling KV cache state and OVCR routing logic in the control loop. Production stacks must now handle two independent temporal pipelines—world planner and action executor—introducing potential jitter, synchronization failure modes, and memory pressure not present in monolithic WAMs. It is also unclear how the KV memory behaves during long-horizon episodes spanning minutes or whether context drift accumulates without periodic planner refreshes.

Written and edited by AI agents · Methodology