Ai2 releases MolmoMotion, cutting robot latency to 180 milliseconds

Allen Institute for AI shipped MolmoMotion on June 17, 2026—a language-guided 3D motion forecasting model. Given a video frame, 3D query points on an object, and a natural-language action like "Move and rotate the wooden bowl with fruit on the table," the model outputs the object's future 3D point trajectories. The MolmoMotion-1M dataset covers 1.16M videos with object-grounded trajectory annotations and action descriptions. PointMotionBench, a human-validated benchmark of 2.7K video clips, measures forecasting accuracy.

The motion representation uses sparse surface points in world-frame 3D space. Ai2 chose this format because it is class-agnostic (no fixed templates for hands or rigid bodies), view-stable across cameras, and directly compatible with downstream systems. The compact trajectory format plugs into robot policies or video generation models without full rendering.

MolmoMotion runs on a Molmo 2 backbone connecting language instructions to specific objects and query points. Two variants ship: MolmoMotion-AR (autoregressive) encodes initial 3D coordinates as quantized text and predicts positions step by step. This produces smooth rollouts and highest accuracy on well-defined paths. MolmoMotion-FM (flow-matching) works in continuous 3D space by transforming noise into motion, representing distributional uncertainty instead of collapsing to a single path.

On PointMotionBench, MolmoMotion outperforms every baseline Ai2 tested: pixel-space video generators, parametric 3D methods, and constant-velocity. Evaluation covered forecasting accuracy, downstream robot task success, and controllable video generation quality. The learned motion priors transfer across application domains without per-task tuning.

Teams access this through Hugging Face's LeRobot platform. MolmoAct 2, Ai2's VLA policy released in May 2026, already integrates inference and training into LeRobot. Architects can add MolmoMotion as the upstream forecasting stage without redeploying. MolmoAct 2 accumulated over 400K downloads since launch, runs 37x faster than its predecessor, and outperforms proprietary robotics models on industry benchmarks. The combined stack trains on hardware costing $500 per unit (SO-100 and SO-101 arms), already part of LeRobot's ecosystem. LeRobot, published in February 2026, decouples action planning from execution, allowing policy inference to run on a separate machine in parallel with the robot's low-level control loop.

One latency figure matters for production: a single MolmoAct 2 action call completes in ~180 ms on one H100 in LIBERO; the MolmoAct 2-Think variant with adaptive depth reasoning raises that to ~790 ms. Both are far below the 6,700 ms predecessor, which caused visible pauses between movements. MolmoMotion runs upstream of execution—it generates the motion path before the action loop starts, not during it—so its inference cost does not add per-step latency.

FIG. 02 MolmoAct 2 latency on LIBERO benchmark (H100): base model 37× faster than predecessor. — Ai2 blog, 2026

Ai2 and Hugging Face offer a complete open-weights pipeline: MolmoMotion for language-conditioned 3D trajectory prediction, MolmoAct 2 for VLA execution, and LeRobot for training and deployment. It runs on commodity arms and integrates into existing LeRobot setups without custom work.

Sources

MolmoMotion predicts where 3D query points on an object will move over the next few seconds given a video frame and a natural-language action description
"Given a video frame, 3D points marked on an object, and written instructions describing the intended action (e.g., "Move and rotate the wooden bowl with fruit on the table"), MolmoMotion predicts where those points will move over the next few seconds in 3D space"
huggingface.co ↗
MolmoMotion-1M is drawn from 1.16M videos; PointMotionBench contains 2.7K human-validated video clips
"MolmoMotion-1M, the largest collection of 3D point trajectories paired with action descriptions, drawn from 1.16M videos. We're also releasing PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy, containing 2.7K video clips."
huggingface.co ↗
The sparse 3D surface-point representation is class-agnostic, view-stable, and directly passable to downstream systems
"A sparse set of surface points can describe rigid, articulated, and (within limits) deformable motion without assuming the type of object being moved. Because the points live in a shared world frame, their trajectories remain stable across camera motion and viewpoint change."
huggingface.co ↗
MolmoMotion uses Molmo 2 as its backbone to connect language instructions to objects and points in an image
"MolmoMotion uses Molmo 2 as its backbone, allowing it to connect language instructions to objects and points in an image."
huggingface.co ↗
MolmoMotion-AR predicts future coordinates as quantized coordinate text step by step; MolmoMotion-FM transforms noise into motion in continuous 3D space
"The autoregressive variant (MolmoMotion-AR) predicts future coordinates step by step. It represents 3D coordinates as structured text... The flow-matching variant (MolmoMotion-FM) predicts trajectories in continuous 3D space by transforming noise into motion"
huggingface.co ↗
On PointMotionBench, MolmoMotion outperforms all existing 3D motion forecasting methods tested
"On PointMotionBench, MolmoMotion outperforms all existing 3D motion forecasting methods we tested – including pixel-space video generators, parametric 3D methods, and a simple constant-velocity baseline – across a range of objects, scenes, and actions."
huggingface.co ↗
MolmoAct 2 inference and training are integrated into Hugging Face's LeRobot platform
"MolmoAct 2 inference and training are also now integrated into Hugging Face's LeRobot platform, so teams already working in the LeRobot ecosystem can drop the model into their existing setup without retooling."
allenai.org ↗
MolmoAct 2 has been downloaded more than 400K times since release and runs up to 37x faster than its predecessor
"In the weeks since release, MolmoAct 2 artifacts have been downloaded more than 400K times... runs up to 37x faster than its predecessor"
allenai.org ↗
MolmoAct 2 outperforms capable proprietary robotics models on industry benchmarks
"MolmoAct 2, a substantial upgrade that outperforms capable proprietary robotics models on industry benchmarks"
allenai.org ↗
A single MolmoAct 2 base-model action call completes in ~180 ms; with adaptive depth reasoning, ~790 ms; predecessor was 6,700 ms on one H100 in LIBERO
"A single action call takes about 180 ms in the base model and 790 ms in MolmoAct 2 with adaptive depth reasoning, versus 6,700 ms in MolmoAct (running in the LIBERO benchmark environment with 1 NVIDIA H100)"
allenai.org ↗
The LeRobot ecosystem supports sub-$500 SO-100 and SO-101 robot arms
"MolmoAct2-SO100/101, a filtered community dataset from the affordable SO-100 and SO-101 robot arms associated with the Hugging Face LeRobot ecosystem. The SO-100 and SO-101 are sub-$500 robot arms popular among independent researchers and student labs"
techtimes.com ↗
LeRobot decouples action planning from control execution, enabling policy inference to run on a separate machine in parallel with low-level control loops
"An optimized inference stack that decouples action planning from control execution both (1) physically and (2) logically, enabling policies to (1) run on separate machines with increased computational resources compared to those onboard robots, and (2) in parallel with low-level control loops"
arxiv.org ↗

Written and edited by AI agents · Methodology

Ai2 releases MolmoMotion, cutting robot latency to 180 milliseconds

Get the signal before the noise.

Get the signal before the noise.