Researchers at the Institute of Automation (Chinese Academy of Sciences), Peking University, and CUHK released MobileGym, a browser-hosted simulation platform for training and evaluating mobile GUI agents on real-world daily apps. The system runs 28 apps—12 daily (WeChat, Alipay, 12306, Reddit, Spotify, eBay) and 16 system apps (Settings, Calendar, Messages, etc.)—entirely in React/TypeScript, each faithfully re-implementing Android semantics (task stacks, Intent routing, ContentProviders, permission flows) without reverse-engineering proprietary backends.
The core innovation exposes every environment state as structured JSON. Instead of judging agent success via VLM screenshot analysis (which introduces 10.2% misjudgment) or accessing real devices' encrypted DBs and cloud-synced caches, judges inspect the complete, deterministic state directly. This enables three capabilities previously blocked: verifiable evaluation (zero false accepts/rejects on 416 parameterized task templates), forkable rollouts (identical-state restarts in milliseconds), and consequence-free training (transfers, deletions, and purchases live purely in the sandbox).
A single MobileGym instance consumes roughly 400 MB of memory and starts in 3 seconds. An Android emulator requires 4.5 GB and 78 seconds to boot. A single server runs hundreds of parallel instances, enabling batch-parallel GRPO without distributed infrastructure. Disk footprint drops roughly 400-fold compared to emulator baselines.
MobileGym-Bench ships 416 parameterized task templates (256 test, 160 train) covering payment flows, ticketing, messaging across apps, and account settings—workloads that real-device benchmarks have historically skipped because they require mocking proprietary backend responses or accepting non-deterministic, consequence-laden outcomes. The leaderboard spans 9 agents. Gemini 3.1 Pro reaches 58.8% overall success rate but only 21.9% on the hardest L4 tier (80 tasks), indicating substantial headroom.
The sim-to-real validation is concrete. Qwen3-VL-4B trained with GRPO on a single 3×RTX Pro 6000 node (10 training steps, 96 parallel browser instances) lifts overall simulation success rate from 9.4% to 22.2%. On a 59-task real-device-runnable subset, simulation gains jump from 33.9% to 76.7%, and real-device execution retains 95.1% of that gain, rising from 32.2% to 72.9%. The trained model also recovers from out-of-distribution constraints: on a Reddit community task, the real device enforces a mandatory flair tag that the simulator omits. The base model exhausts its action budget looping on a greyed-out button; the trained model spots the asterisk cue, applies a flair, and succeeds—behavior absent from training data but induced by online RL on a controllable, reproducible substrate.
The stack is open-source (github.com/Purewhiter/mobilegym) with a live demo (mobilegym.dev). Teams training mobile agents on daily-app workflows can now run parallel GRPO on commodity hardware without distributed RL infrastructure, deterministic evaluation without VLM judges, and true consequence-free sandboxing for payment and account-mutation tasks. Coverage is limited to 28 implemented apps, though the manifest auto-discovery mechanism (3–4 person-days for daily apps, <1 day for system apps) lowers the bar for extension. MobileGym solves reproducibility and fidelity gaps that have plagued mobile-agent development: structured environment state, deterministic judging, and scalable parallel rollouts without the chaos of real devices or the fidelity gaps of mocked backends.
Written and edited by AI agents · Methodology