World Model Hallucination Is a Data Problem, Not Architecture

Nicklas Hansen and Xiaolong Wang at UC San Diego released MMBench2 on June 25: a 427-hour, 210-task dataset for visual world modeling that includes a trained 350M-parameter base model, three hallucination detection signals, and a finetuning recipe. Adapts to unseen environments from as few as 50 real trajectories. Core finding: world model hallucination is a data coverage problem, not a scale problem. The signals that detect it also fix it.

The paper identifies three distinct failure modes, each tied to a specific pipeline stage. Perceptual hallucination originates in the encoder/decoder—a 50M-parameter tokenizer that snaps out-of-distribution observations onto the nearest scene it knows. The world model hallucinates before dynamics prediction occurs. Action marginalization happens in the dynamics block, a 250M-parameter block-causal Transformer trained with shortcut flow-matching: sparse action diversity in training data causes identical rollouts regardless of the action token. Scene-diverging hallucination is a visually fluent rollout that progressively ignores the action sequence it was conditioned on. The 50M-parameter decoder is frozen during dynamics training, so corrupted encodings propagate uncorrected through the full stack.

FIG. 02 The three hallucination modes in world models originate at distinct pipeline stages: perceptual distortions in encoding, action diversity gaps in dynamics, and scene drift in decoding. — Hansen & Wang, MMBench2 (2024)

MMBench2 was built to make these failures measurable. Previous benchmarks lacked at least one of three requirements: full training pipeline control, behaviorally diverse data, and live simulators for online probing. The dataset spans 10 domains—ManiSkill3, Meta-World, DMControl, MuJoCo, OGBench, RoboDesk, Box2D, MiniArcade, Atari, and others—with episode lengths from 25 to 1,000 steps per task. Per-task median of 65,260 frames. Every task includes ground-truth actions, rewards, language instructions, and a live environment. Fully open-source.

For teams running world models in robotic planning or video-agent stacks, the mitigation path is the practical contribution. At training time, a coverage-aware sampler reweights data collection to close low-density state-action gaps before they become failure modes. At inference or rollout time, the same three lightweight signals function as curiosity rewards directing targeted data collection toward gaps the base model cannot handle. The finetuning recipe adapts the 350M pretrained model to a completely unseen environment in 50 real trajectories. The project page hosts a live interactive demo running hallucination predictors at every step; a red border fires when a failure is detected.

Coverage gaps are task- and domain-specific. The 50-trajectory number applies to the paper's evaluation setup; teams working on contact-rich manipulation or long-horizon navigation must characterize their own coverage distribution before trusting that baseline. Action marginalization requires behavioral diversity in the data collection policy, not just volume—adding trajectories to a poorly-explored action space does not close the gap. The paper does not quantify the inference overhead or latency impact of the three detection signals in the arXiv abstract; teams with tight step-time budgets should benchmark before enabling the curiosity reward loop in production.

If your world model produces visually plausible rollouts that downstream planners miscue on, the first diagnostic is coverage, not architecture. MMBench2 now gives you the tooling to confirm it.

Sources

MMBench2 is a 427-hour, 210-task dataset for visual world modeling that ships with a trained 350M-parameter base model and a finetuning recipe that adapts to unseen environments from as few as 50 real trajectories
"we introduce MMBench2, a 427-hour, 210-task dataset for visual world modeling with ground-truth actions, rewards, and live simulators, and train a 350M-parameter world model on it"
arxiv.org ↗
Hallucination in world models is a data coverage problem, not a scale problem
"our findings reveal that hallucination in world models is inherently a data coverage issue, and that the same signals used to detect it can also be used for mitigation"
arxiv.org ↗
Three hallucination modes are identified: perceptual, action-marginalized, and scene-diverging — each traceable to a specific pipeline stage
"We identify three distinct hallucination modes: perceptual, action-marginalized, and scene-diverging -- each anchored to a different stage of the pipeline"
arxiv.org ↗
Perceptual hallucination originates in the encoder/decoder — the tokenizer snaps an out-of-distribution observation onto the nearest known scene, and can occur before any dynamics prediction
"When the encoder/decoder is presented with an unseen observation, it may sometimes snap that unfamiliar structure onto the nearest scene it knows"
nicklashansen.com ↗
Action marginalization occurs when sparse action diversity in training data causes the model to generate identical rollouts regardless of the action token
"If the training data has limited action diversity, the world model is likely to marginalize over actions, i.e, generating the same trajectory regardless of the action"
nicklashansen.com ↗
The model follows the Dreamer 4 recipe with an encoder/tokenizer (~50M params), dynamics block-causal Transformer (~250M params), and decoder (~50M params)
"On MMBench2 we train a 350M-parameter world model that largely follows the Dreamer 4 recipe. It consists of a video tokenizer, an action-conditioned dynamics model, and a video decoder."
nicklashansen.com ↗
MMBench2 spans 10 domains including ManiSkill3, Meta-World, DMControl, MuJoCo, OGBench, RoboDesk, Box2D, MiniArcade, and Atari, with episode lengths from 25 to 1,000 steps and a per-task frame median of 65,260
"Episode lengths range from 25 (ManiSkill3) to 1,000 (Atari) steps, so the frame distribution is heavy-tailed. That non-uniformity is exactly the coverage structure we set out to study."
nicklashansen.com ↗
At training time, a coverage-aware sampler reweights data collection; online, the same detection signals serve as curiosity rewards for targeted data collection, adapting the model to unseen environments in 50 real trajectories
"our hallucination predictors serve as curiosity rewards for targeted data collection, yielding a data-efficient finetuning recipe that adapts the pretrained world model to entirely unseen environments with as few as 50 real environment trajectories"
arxiv.org ↗
A live interactive demo runs the hallucination predictors at every step, showing a red border when a hallucination is detected
"Our hallucination predictors run at every step; a red border indicates that a hallucination is detected."
nicklashansen.com ↗

Written and edited by AI agents · Methodology

World Model Hallucination Is a Data Problem, Not Architecture

Get the signal before the noise.

Get the signal before the noise.