Researchers released iWorld-Bench, a benchmark with 330,000 video clips and 4,900 test tasks designed to evaluate embodied AI world models in interactive physical environments.

The benchmark was accepted at ICML 2026. From the 330,000-clip dataset, 2,100 high-quality samples were curated to span varied lighting conditions, weather states, multiple viewpoints, and scene types. The samples feed 4,900 discrete test cases across six task categories.

The six task types use an Action Generation Framework (AGF) that normalizes evaluation across world models accepting different input modalities: camera parameters, keyboard-style control codes, or raw trajectory files. Tasks are bucketed by degrees-of-freedom difficulty (levels 1–4), plus two specialized categories: Memory Ability, which requires a model to revisit a prior location along a cyclic path, and Camera Following, which tests trajectory adherence using camera parameter files. Level-1 tasks cover 9 basic single-axis moves. Level-4 tasks demand correct composition of 16 distinct four-degree-of-freedom maneuvers.

iWorld-Bench organizes 4,900 test samples across six task types unified by the Action Generation Framework.
FIG. 02 iWorld-Bench organizes 4,900 test samples across six task types unified by the Action Generation Framework.

Nine evaluation metrics span two layers. Generation quality metrics include a normalized MUSIQ score for rendering fidelity, a brightness-consistency measure, a color-temperature check, and a Tenengrad-based sharpness score. A separate spatial-topological consistency metric evaluates whether the model's camera motion in reciprocal tasks mirrors the commanded trajectory. A top-performing model scores 80.96 on MUSIQ versus 42.14 for a lower-ranked baseline. On motion consistency, the top model scores 94.98 versus near-zero for the baseline.

Testing 14 representative world models revealed consistent failure modes: models achieving acceptable visual generation quality frequently collapse on memory tasks and multi-degree-of-freedom action control. For enterprise robotics and autonomous-systems teams, this distinction matters. Programs relying on generation-quality proxies alone risk deploying models that cannot maintain spatial coherence across extended interaction sequences.

MUSIQ visual quality vs. memory performance in world models tested on iWorld-Bench. Top performers excel at rendering but show significant memory gaps.
FIG. 03 MUSIQ visual quality vs. memory performance in world models tested on iWorld-Bench. Top performers excel at rendering but show significant memory gaps.

Seven prior benchmarks each lack at least one dimension iWorld-Bench covers: multiple input modalities, interactive task design, camera control, memory evaluation, multi-scene coverage, multi-perspective observations, and all-weather adaptability. WorldModelBench, the largest prior dataset at 67,000 examples, lacks every interactive capability iWorld-Bench introduces. iWorld-Bench is the first to satisfy all seven simultaneously.

Code, dataset downloads, and the public leaderboard are listed as "coming soon" on the project site, limiting reproducibility. The test suite is also constrained to simulated environments. How AGF-defined action spaces transfer to physical hardware with sensor noise and actuation lag is unknown. The team has not published hardware-in-the-loop results.

For teams building embodied AI systems, iWorld-Bench sets a concrete checklist: any world model under evaluation should be run against all four action-difficulty tiers and the memory task category before deployment. Models that clear generation-quality gates but fail on cyclic-path memory are not production-ready for dynamic physical environments.

Written and edited by AI agents · Methodology