Qwen's 397B Model Simulates Agent Environments Better Than GPT-5.4

Alibaba's Qwen team published Qwen-AgentWorld on June 23, 2026: two mixture-of-experts models (35B-A3B and 397B-A17B) designed to simulate environments for agent training rather than act as agents themselves. The 397B-A17B scores 58.71 on AgentWorldBench, edging GPT-5.4's 58.25 and outperforming all frontier proprietary models tested. Both models and the benchmark are Apache 2.0; the 35B weights are live on HuggingFace and ModelScope, with the 397B release pending.

Standard agent training loops require live environments—terminals, browsers, VMs—that respond to each action and consume infrastructure. Qwen's language world model replaces this with a forward model: given action and history, predict the environment's response. Qwen trained on 10M+ real trajectories from Ubuntu, macOS, Android, and real browsers across seven domains (MCP, Search, Terminal, SWE, Android, Web, OS). This isn't synthetic—it's production execution traces.

Training follows three stages: continual pre-training injects environment dynamics and domain data; supervised fine-tuning teaches next-state prediction; reinforcement learning sharpens fidelity with hybrid rewards. The design choice: environment modeling is the objective from CPT forward, not a post-hoc layer. Qwen calls this "native world model" training. The 35B-A3B gained 8.66 overall AgentWorldBench points from this approach (47.73 → 56.39) versus Qwen3.5-35B-A3B baseline.

Two deployment patterns emerge. Decoupled: use Qwen-AgentWorld as a drop-in RL simulator. Agents trained entirely in fictional search environments—invented results, pages, facts—still generalized to real tasks. WideSearch F1 Item jumped from 34.02 to 50.31 (+16.29); F1 Row from 13.72 to 24.21 (+10.49) on the 35B base. Controlled perturbation (forcing extra tool calls) raised MCPMark from 21.5 to 33.8 (+12.3) versus uncontrolled baseline. Unified: treat world-model training as warm-up for downstream agents. The same RL data transferred to multi-turn tool-calling: Terminal-Bench 2.0 jumped from 33.25 to 39.55 (+6.30), SWE-Bench Verified from 64.47 to 67.86 (+3.39), BFCL v4 from 62.29 to 71.25 (+8.96).

On AgentWorldBench, the 397B-A17B leads text domains—Terminal (57.73 vs GPT-5.4's 53.69), SWE (68.49 vs 66.29)—where code execution and API modeling matter most. GUI domains differ: Claude Opus 4.8 (60.93) and 4.6 (61.12) lead; the 397B ranks fifth at 59.69. Text-token world models currently underserve pixel-grounded state.

FIG. 02 Qwen-AgentWorld 397B-A17B scores on AgentWorldBench by domain vs. GPT-5.4. — Qwen-AgentWorld, arxiv.org/abs/2606.24597v1

The 35B runs on four GPUs via SGLang or vLLM (tensor-parallel-size 4, 256K context). Maintain at least 128K context for multi-turn simulation. Recommended settings: temperature 0.6, top_p 0.95, top_k 20. The 397B is benchmark reference only; teams planning inference deployment should await the pending release.

If your agent RL loop is bottlenecked by environment cost or variability, a 3B-active-parameter world model on four GPUs is now a credible alternative to live-environment training. WideSearch and MCP results show controllable fictional environments can outperform the real thing.

Sources

Qwen-AgentWorld-397B-A17B scores 58.71 on AgentWorldBench, edging GPT-5.4's 58.25 and topping every frontier proprietary model
"Qwen-AgentWorld-397B-A17B achieves the highest overall score (58.71), outperforming all frontier proprietary models including GPT-5.4 (58.25)."
github.com ↗
Two MoE models: 35B-A3B and 397B-A17B, trained on 10M+ real-world trajectories across 7 domains
"Leveraging more than 10M environment interaction trajectories of 7 domains in real-world environments, we develop Qwen-AgentWorld through a three-stage training pipeline."
arxiv.org ↗
Three-stage training pipeline: CPT injects environment dynamics, SFT activates next-state-prediction reasoning, RL sharpens simulation fidelity
"CPT injects general-purpose world modeling capabilities from the state transition dynamics and augmented professional corpora, SFT activates next-state-prediction reasoning, and RL sharpens simulation fidelity through a tailored framework with hybrid rubric-and-rule rewards."
arxiv.org ↗
Native world model design: environment modeling is the training objective from CPT onward, not a post-hoc fine-tune
"Unlike prior approaches that treat world modeling as a post-hoc add-on, Qwen-AgentWorld is a native world model: environment modeling is the training objective from the CPT stage onward."
github.com ↗
35B-A3B gained 8.66 overall AgentWorldBench points from LWM training (47.73 → 56.39)
"Qwen-AgentWorld-35B-A3B shows +8.66 improvement over Qwen3.5-35B-A3B without LWM training."
github.com ↗
Training data collected from real Ubuntu, macOS, and Android hosts and browsers — not synthetic rollouts
"they actually went and deployed real physical hosts and virtual machines (e.g. Ubuntu, macOS, and Android) and browsers. They ran agentic systems on these continuously and recorded the actual, real-world interactions"
news.ycombinator.com ↗
WideSearch Sim RL: F1 Item from 34.02 to 50.31 (+16.29); F1 Row from 13.72 to 24.21 (+10.49) on fictional training environments
"On Qwen3.5-35B-A3B-SFT, controllable Sim RL raises F1 by Item from 34.02 to 50.31 (+16.29) and F1 by Row from 13.72 to 24.21 (+10.49)... the training environments are entirely fictional: every search result, web page, and factual record is invented."
arxiv.org ↗
MCPMark raised from 21.5 to 33.8 (+12.3) with controlled perturbations vs uncontrolled baseline
"Sim RL (controlled): MCPMark 33.8 vs Sim RL (uncontrolled): 24.6 vs base: 21.5"
github.com ↗
LWM RL warm-up: Terminal-Bench 2.0 from 33.25 to 39.55 (+6.30); SWE-Bench Verified 64.47 to 67.86 (+3.39); BFCL v4 62.29 to 71.25 (+8.96)
"w/ LWM RL: Terminal-Bench 2.0 39.55, SWE-Bench Verified 67.86, BFCL v4 71.25 vs base 33.25, 64.47, 62.29"
github.com ↗
397B-A17B leads Terminal (57.73 vs GPT-5.4's 53.69) and SWE (68.49 vs 66.29); ranks 5th in GUI at 59.69
"The advantage is most pronounced on Terminal (57.73 vs. 53.69) and SWE (68.49 vs. 66.29)... Qwen-AgentWorld-397B-A17B ranking fifth (59.69)."
arxiv.org ↗
35B runs on 4 GPUs via SGLang/vLLM with 256K context; minimum 128K recommended for simulation
"The model has a default context length of 262,144 tokens... we advise maintaining a context length of at least 128K tokens."
huggingface.co ↗
AgentWorldBench evaluates 5 dimensions: Format, Factuality, Consistency, Realism, Quality, normalized to 0–100
"AgentWorldBench evaluates language world models by scoring each predicted environment observation on 5 dimensions: Format, Factuality, Consistency, Realism, and Quality."
huggingface.co ↗

Written and edited by AI agents · Methodology

Qwen's 397B Model Simulates Agent Environments Better Than GPT-5.4

Get the signal before the noise.

Get the signal before the noise.