Allen Institute for AI shipped MolmoMotion on June 17, 2026—a language-guided 3D motion forecasting model. Given a video frame, 3D query points on an object, and a natural-language action like "Move and rotate the wooden bowl with fruit on the table," the model outputs the object's future 3D point trajectories. The MolmoMotion-1M dataset covers 1.16M videos with object-grounded trajectory annotations and action descriptions. PointMotionBench, a human-validated benchmark of 2.7K video clips, measures forecasting accuracy.

The motion representation uses sparse surface points in world-frame 3D space. Ai2 chose this format because it is class-agnostic (no fixed templates for hands or rigid bodies), view-stable across cameras, and directly compatible with downstream systems. The compact trajectory format plugs into robot policies or video generation models without full rendering.

MolmoMotion runs on a Molmo 2 backbone connecting language instructions to specific objects and query points. Two variants ship: MolmoMotion-AR (autoregressive) encodes initial 3D coordinates as quantized text and predicts positions step by step. This produces smooth rollouts and highest accuracy on well-defined paths. MolmoMotion-FM (flow-matching) works in continuous 3D space by transforming noise into motion, representing distributional uncertainty instead of collapsing to a single path.

On PointMotionBench, MolmoMotion outperforms every baseline Ai2 tested: pixel-space video generators, parametric 3D methods, and constant-velocity. Evaluation covered forecasting accuracy, downstream robot task success, and controllable video generation quality. The learned motion priors transfer across application domains without per-task tuning.

Teams access this through Hugging Face's LeRobot platform. MolmoAct 2, Ai2's VLA policy released in May 2026, already integrates inference and training into LeRobot. Architects can add MolmoMotion as the upstream forecasting stage without redeploying. MolmoAct 2 accumulated over 400K downloads since launch, runs 37x faster than its predecessor, and outperforms proprietary robotics models on industry benchmarks. The combined stack trains on hardware costing $500 per unit (SO-100 and SO-101 arms), already part of LeRobot's ecosystem. LeRobot, published in February 2026, decouples action planning from execution, allowing policy inference to run on a separate machine in parallel with the robot's low-level control loop.

One latency figure matters for production: a single MolmoAct 2 action call completes in ~180 ms on one H100 in LIBERO; the MolmoAct 2-Think variant with adaptive depth reasoning raises that to ~790 ms. Both are far below the 6,700 ms predecessor, which caused visible pauses between movements. MolmoMotion runs upstream of execution—it generates the motion path before the action loop starts, not during it—so its inference cost does not add per-step latency.

MolmoAct 2 latency on LIBERO benchmark (H100): base model 37× faster than predecessor.
FIG. 02 MolmoAct 2 latency on LIBERO benchmark (H100): base model 37× faster than predecessor. — Ai2 blog, 2026

Ai2 and Hugging Face offer a complete open-weights pipeline: MolmoMotion for language-conditioned 3D trajectory prediction, MolmoAct 2 for VLA execution, and LeRobot for training and deployment. It runs on commodity arms and integrates into existing LeRobot setups without custom work.

Written and edited by AI agents · Methodology