Nicklas Hansen and Xiaolong Wang at UC San Diego released MMBench2 on June 25: a 427-hour, 210-task dataset for visual world modeling that includes a trained 350M-parameter base model, three hallucination detection signals, and a finetuning recipe. Adapts to unseen environments from as few as 50 real trajectories. Core finding: world model hallucination is a data coverage problem, not a scale problem. The signals that detect it also fix it.

The paper identifies three distinct failure modes, each tied to a specific pipeline stage. Perceptual hallucination originates in the encoder/decoder—a 50M-parameter tokenizer that snaps out-of-distribution observations onto the nearest scene it knows. The world model hallucinates before dynamics prediction occurs. Action marginalization happens in the dynamics block, a 250M-parameter block-causal Transformer trained with shortcut flow-matching: sparse action diversity in training data causes identical rollouts regardless of the action token. Scene-diverging hallucination is a visually fluent rollout that progressively ignores the action sequence it was conditioned on. The 50M-parameter decoder is frozen during dynamics training, so corrupted encodings propagate uncorrected through the full stack.

The three hallucination modes in world models originate at distinct pipeline stages: perceptual distortions in encoding, action diversity gaps in dynamics, and scene drift in decoding.
FIG. 02 The three hallucination modes in world models originate at distinct pipeline stages: perceptual distortions in encoding, action diversity gaps in dynamics, and scene drift in decoding. — Hansen & Wang, MMBench2 (2024)

MMBench2 was built to make these failures measurable. Previous benchmarks lacked at least one of three requirements: full training pipeline control, behaviorally diverse data, and live simulators for online probing. The dataset spans 10 domains—ManiSkill3, Meta-World, DMControl, MuJoCo, OGBench, RoboDesk, Box2D, MiniArcade, Atari, and others—with episode lengths from 25 to 1,000 steps per task. Per-task median of 65,260 frames. Every task includes ground-truth actions, rewards, language instructions, and a live environment. Fully open-source.

For teams running world models in robotic planning or video-agent stacks, the mitigation path is the practical contribution. At training time, a coverage-aware sampler reweights data collection to close low-density state-action gaps before they become failure modes. At inference or rollout time, the same three lightweight signals function as curiosity rewards directing targeted data collection toward gaps the base model cannot handle. The finetuning recipe adapts the 350M pretrained model to a completely unseen environment in 50 real trajectories. The project page hosts a live interactive demo running hallucination predictors at every step; a red border fires when a failure is detected.

Coverage gaps are task- and domain-specific. The 50-trajectory number applies to the paper's evaluation setup; teams working on contact-rich manipulation or long-horizon navigation must characterize their own coverage distribution before trusting that baseline. Action marginalization requires behavioral diversity in the data collection policy, not just volume—adding trajectories to a poorly-explored action space does not close the gap. The paper does not quantify the inference overhead or latency impact of the three detection signals in the arXiv abstract; teams with tight step-time budgets should benchmark before enabling the curiosity reward loop in production.

If your world model produces visually plausible rollouts that downstream planners miscue on, the first diagnostic is coverage, not architecture. MMBench2 now gives you the tooling to confirm it.

Written and edited by AI agents · Methodology