MobileGym Solves Mobile-Agent Reproducibility at Scale

Researchers at the Institute of Automation (Chinese Academy of Sciences), Peking University, and CUHK released MobileGym, a browser-hosted simulation platform for training and evaluating mobile GUI agents on real-world daily apps. The system runs 28 apps—12 daily (WeChat, Alipay, 12306, Reddit, Spotify, eBay) and 16 system apps (Settings, Calendar, Messages, etc.)—entirely in React/TypeScript, each faithfully re-implementing Android semantics (task stacks, Intent routing, ContentProviders, permission flows) without reverse-engineering proprietary backends.

The core innovation exposes every environment state as structured JSON. Instead of judging agent success via VLM screenshot analysis (which introduces 10.2% misjudgment) or accessing real devices' encrypted DBs and cloud-synced caches, judges inspect the complete, deterministic state directly. This enables three capabilities previously blocked: verifiable evaluation (zero false accepts/rejects on 416 parameterized task templates), forkable rollouts (identical-state restarts in milliseconds), and consequence-free training (transfers, deletions, and purchases live purely in the sandbox).

A single MobileGym instance consumes roughly 400 MB of memory and starts in 3 seconds. An Android emulator requires 4.5 GB and 78 seconds to boot. A single server runs hundreds of parallel instances, enabling batch-parallel GRPO without distributed infrastructure. Disk footprint drops roughly 400-fold compared to emulator baselines.

FIG. 02 MobileGym vs. Android Emulator: 11× smaller memory footprint, 26× faster cold start

MobileGym-Bench ships 416 parameterized task templates (256 test, 160 train) covering payment flows, ticketing, messaging across apps, and account settings—workloads that real-device benchmarks have historically skipped because they require mocking proprietary backend responses or accepting non-deterministic, consequence-laden outcomes. The leaderboard spans 9 agents. Gemini 3.1 Pro reaches 58.8% overall success rate but only 21.9% on the hardest L4 tier (80 tasks), indicating substantial headroom.

The sim-to-real validation is concrete. Qwen3-VL-4B trained with GRPO on a single 3×RTX Pro 6000 node (10 training steps, 96 parallel browser instances) lifts overall simulation success rate from 9.4% to 22.2%. On a 59-task real-device-runnable subset, simulation gains jump from 33.9% to 76.7%, and real-device execution retains 95.1% of that gain, rising from 32.2% to 72.9%. The trained model also recovers from out-of-distribution constraints: on a Reddit community task, the real device enforces a mandatory flair tag that the simulator omits. The base model exhausts its action budget looping on a greyed-out button; the trained model spots the asterisk cue, applies a flair, and succeeds—behavior absent from training data but induced by online RL on a controllable, reproducible substrate.

FIG. 03 GRPO training and sim-to-real transfer: task success lifts from 9.4% baseline to 72.9% on real devices

The stack is open-source (github.com/Purewhiter/mobilegym) with a live demo (mobilegym.dev). Teams training mobile agents on daily-app workflows can now run parallel GRPO on commodity hardware without distributed RL infrastructure, deterministic evaluation without VLM judges, and true consequence-free sandboxing for payment and account-mutation tasks. Coverage is limited to 28 implemented apps, though the manifest auto-discovery mechanism (3–4 person-days for daily apps, <1 day for system apps) lowers the bar for extension. MobileGym solves reproducibility and fidelity gaps that have plagued mobile-agent development: structured environment state, deterministic judging, and scalable parallel rollouts without the chaos of real devices or the fidelity gaps of mocked backends.

Sources

MobileGym runs 28 apps including WeChat, Alipay, 12306, Reddit, Spotify, eBay, and system apps
"MobileGym is a verifiable and highly parallel simulation platform for mobile GUI agent research — the first to make online RL training and deterministic evaluation feasible on real-world daily apps, long a structural blind spot of real-device pipelines. It covers 28 mobile apps (12 daily + 16 system) in the browser."
mobilegym.dev ↗
VLM judges show 10.2% misjudgment rate compared to programmatic state judges with 0% false accept/reject on 416 tasks
"programmatic state judges show no false accept/reject cases over 416 parameterized task templates (vs. 10.2% misjudgment when the same real-device trajectories are scored by a VLM)"
mobilegym.dev ↗
Single MobileGym instance uses 400 MB memory and 3 second cold start versus Android emulator 4.5 GB and 78 seconds
"Memory / instance ∼400 MB vs ∼4.5 GB ~11× lighter... Cold start ∼3 s vs ∼78 s ~26× faster"
mobilegym.dev ↗
MobileGym-Bench includes 416 parameterized task templates with 256 test and 160 train templates
"The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol"
arxiv.org ↗
Gemini 3.1 Pro achieves 58.8% overall success rate with 21.9% on L4 hardest tier
"Gemini 3.1 Pro 97.5... 83.6... 63.3... 21.9... SR 58.8%"
mobilegym.dev ↗
GRPO training on Qwen3-VL-4B lifts simulation success from 9.4% to 22.2% (+12.8 points)
"GRPO fine-tuning of Qwen3-VL-4B lifts overall simulation SR by +12.8 pt (9.4%→22.2%)"
mobilegym.dev ↗
Real-device execution retains 95.1% of simulation gains, rising from 32.2% to 72.9% (+40.7 pt)
"on the 59-task real-device-runnable signal-bucket subset, the +42.8 pt simulation gain is preserved as +40.7 pt on the real device — 95.1% retention"
mobilegym.dev ↗
Adding a daily app requires 3–4 person-days; system apps take less than 1 day
"~3–4 person-days per daily app, <1 day per system app"
mobilegym.dev ↗
Code is open-source at github.com/Purewhiter/mobilegym with live demo at mobilegym.dev
"[arXiv](https://arxiv.org/abs/2605.26114) [Code](https://github.com/Purewhiter/mobilegym) [BibTeX](#bibtex) [Live demo](#demo)"
mobilegym.dev ↗
Disk footprint is ~50 MB versus ~20 GB for Android emulator baseline, approximately 400× smaller
"Disk footprint ∼50 MB vs ∼20 GB ~400× smaller"
mobilegym.dev ↗

Written and edited by AI agents · Methodology

MobileGym Solves Mobile-Agent Reproducibility at Scale

Get the signal before the noise.

Get the signal before the noise.