iWorld-Bench Exposes Memory Flaws in Top World Models

Researchers released iWorld-Bench, a benchmark with 330,000 video clips and 4,900 test tasks designed to evaluate embodied AI world models in interactive physical environments.

The benchmark was accepted at ICML 2026. From the 330,000-clip dataset, 2,100 high-quality samples were curated to span varied lighting conditions, weather states, multiple viewpoints, and scene types. The samples feed 4,900 discrete test cases across six task categories.

The six task types use an Action Generation Framework (AGF) that normalizes evaluation across world models accepting different input modalities: camera parameters, keyboard-style control codes, or raw trajectory files. Tasks are bucketed by degrees-of-freedom difficulty (levels 1–4), plus two specialized categories: Memory Ability, which requires a model to revisit a prior location along a cyclic path, and Camera Following, which tests trajectory adherence using camera parameter files. Level-1 tasks cover 9 basic single-axis moves. Level-4 tasks demand correct composition of 16 distinct four-degree-of-freedom maneuvers.

FIG. 02 iWorld-Bench organizes 4,900 test samples across six task types unified by the Action Generation Framework.

Nine evaluation metrics span two layers. Generation quality metrics include a normalized MUSIQ score for rendering fidelity, a brightness-consistency measure, a color-temperature check, and a Tenengrad-based sharpness score. A separate spatial-topological consistency metric evaluates whether the model's camera motion in reciprocal tasks mirrors the commanded trajectory. A top-performing model scores 80.96 on MUSIQ versus 42.14 for a lower-ranked baseline. On motion consistency, the top model scores 94.98 versus near-zero for the baseline.

Testing 14 representative world models revealed consistent failure modes: models achieving acceptable visual generation quality frequently collapse on memory tasks and multi-degree-of-freedom action control. For enterprise robotics and autonomous-systems teams, this distinction matters. Programs relying on generation-quality proxies alone risk deploying models that cannot maintain spatial coherence across extended interaction sequences.

FIG. 03 MUSIQ visual quality vs. memory performance in world models tested on iWorld-Bench. Top performers excel at rendering but show significant memory gaps.

Seven prior benchmarks each lack at least one dimension iWorld-Bench covers: multiple input modalities, interactive task design, camera control, memory evaluation, multi-scene coverage, multi-perspective observations, and all-weather adaptability. WorldModelBench, the largest prior dataset at 67,000 examples, lacks every interactive capability iWorld-Bench introduces. iWorld-Bench is the first to satisfy all seven simultaneously.

Code, dataset downloads, and the public leaderboard are listed as "coming soon" on the project site, limiting reproducibility. The test suite is also constrained to simulated environments. How AGF-defined action spaces transfer to physical hardware with sensor noise and actuation lag is unknown. The team has not published hardware-in-the-loop results.

For teams building embodied AI systems, iWorld-Bench sets a concrete checklist: any world model under evaluation should be run against all four action-difficulty tiers and the memory task category before deployment. Models that clear generation-quality gates but fail on cyclic-path memory are not production-ready for dynamic physical environments.

Sources

iWorld-Bench dataset contains 330,000 video clips and 2,100 high-quality selected samples
"We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes."
arxiv.org ↗
iWorld-Bench generates 4,900 test samples across six task types
"we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples."
arxiv.org ↗
14 representative world models were evaluated on iWorld-Bench
"Evaluating 14 representative world models, we identify key limitations and provide insights for future research."
arxiv.org ↗
Tasks assess visual generation, trajectory following, and memory capabilities
"These tasks jointly assess model performance across visual generation, trajectory following, and memory."
arxiv.org ↗
iWorld-Bench was accepted at ICML 2026
"ICML 2026 iWorld-Bench 330K video clips 4.9K test tasks for evaluation 9 comprehensive metrics"
iworld-bench.com ↗
iWorld-Bench uses 9 comprehensive evaluation metrics
"ICML 2026 iWorld-Bench 330K video clips 4.9K test tasks for evaluation 9 comprehensive metrics"
iworld-bench.com ↗
Top model scores 80.96 on MUSIQ vs 42.14 for a lower-ranked baseline
"We evaluate low-level visual distortions by calculating the normalized average MUSIQ score across all frames to reflect fundamental rendering fidelity. Score: 80.96 Score: 42.14"
iworld-bench.com ↗
Motion consistency scores range from 94.98 for the top model vs near-zero for a baseline
"By calculating the mirror similarity of instantaneous displacement vectors, we assess the spatial topological consistency of camera movements in reciprocal tasks. Score: 94.98 Score: 4.00E-04"
iworld-bench.com ↗
Six task types include Action Control levels 1–4, Memory Ability, and Camera Following
"Action Control Difficulty 1 Basic tasks including stationary and 9 basic actions D = 1 1,000 ... Memory Ability Cyclic paths requiring model to visit same location - 200 Camera Following Trajectory following using camera parameter files - 700"
iworld-bench.com ↗
WorldModelBench, the largest prior dataset, contains 67,000 examples but lacks all interactive capabilities iWorld-Bench introduces
"WorldModelBench General World Model ✗ ✗ ✗ ✗ ✗ ✗ ✗ 67,000"
iworld-bench.com ↗
iWorld-Bench is the first benchmark to simultaneously cover multiple inputs, interactive tasks, camera control, memory, multi-scene, multi-perspective, and all-weather evaluation
"iWorld-Bench (Ours) Interactive World Model ✓ ✓ ✓ ✓ ✓ ✓ ✓ 4,900"
iworld-bench.com ↗

Written and edited by AI agents · Methodology

iWorld-Bench Exposes Memory Flaws in Top World Models

Get the signal before the noise.

Get the signal before the noise.