ToolCUA reaches 46.85% on OSWorld, beats frontier agents on efficiency

Researchers at Tongyi Lab (Alibaba Group), Fudan University, and the Shanghai Artificial Intelligence Laboratory have published ToolCUA, an 8-billion-parameter computer-use agent that scores 46.85% on OSWorld-MCP — a 66% relative improvement over the Qwen3-VL-8B-Instruct baseline and completes tasks in 14.93 average steps, fewer than any model in the benchmark.

The core problem is what the authors call "path selection confusion." When agents can invoke both atomic GUI actions (click, type, scroll) and high-level tool calls (API-based file operations, structured desktop commands), they fail to use them together effectively. A diagnostic study shows the failure mode: Qwen3-VL-8B averages only 0.003 tool calls per trajectory after tool access is granted, and accuracy drops from 29.0% to 28.2%. Qwen3-VL-235B swings the other direction — tools cut steps from 25.9 to 17.4, but accuracy falls from 41.1% to 38.1%. Exposure to a hybrid action space without targeted training degrades both models.

FIG. 02 ToolCUA-8B accuracy on OSWorld-MCP vs baseline and frontier models — Alibaba Tongyi Lab OSWorld-MCP benchmark

ToolCUA's solution is a three-stage training pipeline. First, an Interleaved GUI-Tool Trajectory Scaling Pipeline converts 10,000 GUI-only traces into 180,000 SFT-ready steps by synthesizing a tool library of 4,350 unique tools (averaging 19.75 per trajectory), avoiding the cost of collecting real tool trajectories. Second, Tool-Bootstrapped GUI RFT applies supervised fine-tuning to install tool schemas, then uses single-turn reinforcement learning to calibrate GUI-versus-tool switching decisions. Third, Online Agentic RL runs long-horizon rollouts in a live GUI-Tool environment guided by a reward function that scores task success, format validity, tool appropriateness, and path length.

The evaluation benchmark, OSWorld-MCP, extends the OSWorld desktop suite with 150-plus MCP tools across realistic applications and 333 feasible tasks. ToolCUA-8B raises tool invocation rate from 8.41 to 24.32 and cuts average steps from 19.34 to 14.93, while scoring higher than Claude-4-Sonnet (43.54%), Gemini-3.1-Pro (41.14%), and GUI-Owl-1.5-8B (43.84%). Only Claude-4.5-Sonnet (48.35%) and GUI-Owl-1.5-32B (48.05%) exceed ToolCUA-8B on accuracy, and both require more steps.

FIG. 03 ToolCUA efficiency gains: tool invocation rate (left) and completion steps (right) vs. Qwen3-VL-8B baseline — ai|expert chart

For enterprise teams piloting computer-use automation, the implication is direct: an agent with access to both browser and API paths must be trained on path orchestration. Without it, the agent defaults to one mode and underperforms in both. ToolCUA's data synthesis — generating hybrid supervision from GUI-only traces — suggests that organizations with existing RPA or GUI recording infrastructure can bootstrap training data without building a proprietary tool-trajectory collection system.

Generalization matters for production. The online RL stage trains only on single-application Linux tasks, excluding multi-app scenarios. ToolCUA improves multi-app accuracy from 9.8% baseline to 23.9% after RL — a held-out domain gain indicating the reward function teaches transferable path-selection principles. On WindowsAgentArena, a fully unseen environment, ToolCUA-8B reaches 33.8%, outpacing Qwen3-VL-8B-Instruct by 7.4 percentage points despite being trained entirely on Linux.

The paper does not report results on cloud SaaS environments where GUI automation and API calls interleave differently. Tool reward weights are not ablated in the preprint, leaving trade-offs between step count and task success opaque for teams needing to tune the reward for domain-specific workflows. Model weights and code are open-sourced at the project page.

An 8B model that beats frontier closed-source agents on path efficiency while remaining open and trainable on synthetic data shifts the cost calculus for enterprise deployments. The bottleneck is now orchestration methodology, not model scale.

Sources

ToolCUA-8B achieves 46.85% accuracy on OSWorld-MCP, a ~66% relative improvement over the Qwen3-VL-8B-Instruct baseline
"ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale"
arxiv.org ↗
ToolCUA achieves the lowest average completion steps at 14.93 on OSWorld-MCP
"ToolCUA-8B achieves 46.85% accuracy, a relative improvement of about 66% over Qwen3-VL-8B-Instruct, while reaching the lowest average completion steps among compared models at 14.93"
x-plug.github.io ↗
Qwen3-VL-8B averages only 0.003 tool calls per trajectory after tools are introduced, and accuracy drops from 29.0% to 28.2%
"Qwen3-VL-8B barely invokes tools after they are introduced, with only 0.003 tool calls per trajectory and an accuracy drop from 29.0% to 28.2%"
x-plug.github.io ↗
Qwen3-VL-235B reduces average steps from 25.9 to 17.4 with tools but accuracy still drops from 41.1% to 38.1%
"Qwen3-VL-235B calls tools much more frequently, reducing average steps from 25.9 to 17.4, but accuracy still drops from 41.1% to 38.1%"
x-plug.github.io ↗
Training pipeline converts 10,000 source trajectories into 180,000 SFT steps with 4,350 unique tools averaging 19.75 per trajectory
"10k source trajectories 192k raw GUI steps 180k SFT steps 5k critical switching steps 4,350 unique tools 19.75 avg. tools per trajectory"
x-plug.github.io ↗
OSWorld-MCP uses 150+ MCP tools and 333 feasible tasks across realistic desktop applications
"OSWorld-MCP, which extends OSWorld with GUI actions and 150+ MCP tools across realistic desktop applications. We report Accuracy, Tool Invocation Rate (TIR), and Average Completion Steps (ACS) over 333 feasible tasks"
x-plug.github.io ↗
ToolCUA raises Tool Invocation Rate from 8.41 to 24.32 and cuts average completion steps from 19.34 to 14.93 vs Qwen3-VL-8B-Instruct
"ToolCUA improves overall accuracy by +18.62 points, raises TIR from 8.41 to 24.32, and reduces ACS from 19.34 to 14.93"
x-plug.github.io ↗
Claude-4-Sonnet scores 43.54%, Gemini-3.1-Pro 41.14%, Claude-4.5-Sonnet 48.35%, GUI-Owl-1.5-8B 43.84%, GUI-Owl-1.5-32B 48.05% on OSWorld-MCP
"Claude-4-Sonnet 43.54 35.74 19.76 Gemini-3.1-Pro 41.14 34.23 25.40 Claude-4.5-Sonnet 48.35 40.24 19.07 GUI-Owl-1.5-8B 43.84 36.04 21.19 GUI-Owl-1.5-32B 48.05 41.14 24.19"
x-plug.github.io ↗
ToolCUA improves multi-app accuracy from a baseline 9.8% and pre-RL 18.5% to 23.9% on a held-out domain
"ToolCUA improves on the held-out multi_apps domain from the baseline 9.8% and the pre-online-RL stage 18.5% to 23.9%"
x-plug.github.io ↗
ToolCUA-8B reaches 33.8% on WindowsAgentArena, +7.4 percentage points over Qwen3-VL-8B-Instruct, despite being trained on Linux
"ToolCUA reaches 33.8% accuracy on WindowsAgentArena, outperforming the Qwen3-VL-8B-Instruct baseline by 7.4 percentage points and surpassing larger Qwen3-VL variants"
x-plug.github.io ↗

Written and edited by AI agents · Methodology

ToolCUA reaches 46.85% on OSWorld, beats frontier agents on efficiency

Get the signal before the noise.

Get the signal before the noise.