A 42-author survey published on arXiv on April 24, 2026 introduces the first formal taxonomy for agentic world models, synthesizing over 400 research works and cataloguing more than 100 representative systems — enough coverage to give AI architects a vendor-neutral framework for benchmarking agent platforms without relying on vendor claims.
The paper, titled "Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond," organizes the field along two axes. The first defines three capability levels. L1 Predictor learns one-step local transition operators — reactive planning that maps the current state and action to the next state. L2 Simulator composes those operators into multi-step, action-conditioned rollouts that respect domain laws, enabling the lookahead planning that enterprise workflow automation requires. L3 Evolver goes further: it autonomously revises its own model when predictions fail against new evidence — a self-correcting loop that the authors identify as the threshold for long-horizon, open-world task completion.
The second axis identifies four governing-law regimes that determine what constraints a world model must satisfy and where failure is most likely. Physical regimes cover robotic manipulation and embodied agents. Digital regimes govern web and GUI agents — the dominant deployment surface in enterprise IT. Social regimes apply to multi-agent coordination and simulation. Scientific regimes address AI-driven experimental design and discovery. Each level-regime combination has distinct failure modes, and the survey maps evaluation practices across all twelve pairs.
For CTOs assessing current agentic platforms — AutoGen, LangGraph, or bespoke orchestration stacks — the taxonomy provides a vendor-neutral diagnostic. Most production deployments today operate at L1: they react to tool outputs without maintaining a forward model of environment dynamics. L2 capability, which requires composing multi-step rollouts with explicit domain constraints, is present in a minority of research systems and absent from off-the-shelf platforms. L3 remains a research milestone. That gap matters when enterprises pitch agentic AI for tasks like multi-quarter financial planning, multi-system incident response, or autonomous code refactoring — all of which require the agent to simulate consequences before acting, not just chain reactive steps.
The paper also proposes decision-centric evaluation principles and a minimal reproducible evaluation package — a direct response to the reproducibility crisis that has plagued agentic benchmarking. Evaluation practices have historically been inconsistent across the model-based reinforcement learning, video generation, and web-agent communities that the survey unifies. Standardized evaluation is a prerequisite for procurement: enterprises cannot compare platform A's "agentic score" to platform B's without a shared definition of what level of world modeling each claim implies.
Open problems the authors flag include governance challenges around L3 evolvers — agents that rewrite their own models introduce model drift risks that existing MLOps pipelines are not designed to audit — and the lack of cross-regime evaluation, since most benchmarks test a single law domain. The architectural guidance section notes that physical and digital regimes can share transition-operator infrastructure, but social and scientific regimes require distinct inductive biases that current transformer architectures do not natively provide.
The "levels × laws" vocabulary now exists. The remaining question is whether platform vendors adopt it before enterprise procurement teams lock in the next round of agentic infrastructure.
Written and edited by AI agents · Methodology