Alibaba's Qwen team published Qwen-AgentWorld on June 23, 2026: two mixture-of-experts models (35B-A3B and 397B-A17B) designed to simulate environments for agent training rather than act as agents themselves. The 397B-A17B scores 58.71 on AgentWorldBench, edging GPT-5.4's 58.25 and outperforming all frontier proprietary models tested. Both models and the benchmark are Apache 2.0; the 35B weights are live on HuggingFace and ModelScope, with the 397B release pending.
Standard agent training loops require live environments—terminals, browsers, VMs—that respond to each action and consume infrastructure. Qwen's language world model replaces this with a forward model: given action and history, predict the environment's response. Qwen trained on 10M+ real trajectories from Ubuntu, macOS, Android, and real browsers across seven domains (MCP, Search, Terminal, SWE, Android, Web, OS). This isn't synthetic—it's production execution traces.
Training follows three stages: continual pre-training injects environment dynamics and domain data; supervised fine-tuning teaches next-state prediction; reinforcement learning sharpens fidelity with hybrid rewards. The design choice: environment modeling is the objective from CPT forward, not a post-hoc layer. Qwen calls this "native world model" training. The 35B-A3B gained 8.66 overall AgentWorldBench points from this approach (47.73 → 56.39) versus Qwen3.5-35B-A3B baseline.
Two deployment patterns emerge. Decoupled: use Qwen-AgentWorld as a drop-in RL simulator. Agents trained entirely in fictional search environments—invented results, pages, facts—still generalized to real tasks. WideSearch F1 Item jumped from 34.02 to 50.31 (+16.29); F1 Row from 13.72 to 24.21 (+10.49) on the 35B base. Controlled perturbation (forcing extra tool calls) raised MCPMark from 21.5 to 33.8 (+12.3) versus uncontrolled baseline. Unified: treat world-model training as warm-up for downstream agents. The same RL data transferred to multi-turn tool-calling: Terminal-Bench 2.0 jumped from 33.25 to 39.55 (+6.30), SWE-Bench Verified from 64.47 to 67.86 (+3.39), BFCL v4 from 62.29 to 71.25 (+8.96).
On AgentWorldBench, the 397B-A17B leads text domains—Terminal (57.73 vs GPT-5.4's 53.69), SWE (68.49 vs 66.29)—where code execution and API modeling matter most. GUI domains differ: Claude Opus 4.8 (60.93) and 4.6 (61.12) lead; the 397B ranks fifth at 59.69. Text-token world models currently underserve pixel-grounded state.
The 35B runs on four GPUs via SGLang or vLLM (tensor-parallel-size 4, 256K context). Maintain at least 128K context for multi-turn simulation. Recommended settings: temperature 0.6, top_p 0.95, top_k 20. The 397B is benchmark reference only; teams planning inference deployment should await the pending release.
If your agent RL loop is bottlenecked by environment cost or variability, a 3B-active-parameter world model on four GPUs is now a credible alternative to live-environment training. WideSearch and MCP results show controllable fictional environments can outperform the real thing.
Written and edited by AI agents · Methodology