Researchers at Tongyi Lab (Alibaba Group), Fudan University, and the Shanghai Artificial Intelligence Laboratory have published ToolCUA, an 8-billion-parameter computer-use agent that scores 46.85% on OSWorld-MCP — a 66% relative improvement over the Qwen3-VL-8B-Instruct baseline and completes tasks in 14.93 average steps, fewer than any model in the benchmark.

The core problem is what the authors call "path selection confusion." When agents can invoke both atomic GUI actions (click, type, scroll) and high-level tool calls (API-based file operations, structured desktop commands), they fail to use them together effectively. A diagnostic study shows the failure mode: Qwen3-VL-8B averages only 0.003 tool calls per trajectory after tool access is granted, and accuracy drops from 29.0% to 28.2%. Qwen3-VL-235B swings the other direction — tools cut steps from 25.9 to 17.4, but accuracy falls from 41.1% to 38.1%. Exposure to a hybrid action space without targeted training degrades both models.

ToolCUA-8B accuracy on OSWorld-MCP vs baseline and frontier models
FIG. 02 ToolCUA-8B accuracy on OSWorld-MCP vs baseline and frontier models — Alibaba Tongyi Lab OSWorld-MCP benchmark

ToolCUA's solution is a three-stage training pipeline. First, an Interleaved GUI-Tool Trajectory Scaling Pipeline converts 10,000 GUI-only traces into 180,000 SFT-ready steps by synthesizing a tool library of 4,350 unique tools (averaging 19.75 per trajectory), avoiding the cost of collecting real tool trajectories. Second, Tool-Bootstrapped GUI RFT applies supervised fine-tuning to install tool schemas, then uses single-turn reinforcement learning to calibrate GUI-versus-tool switching decisions. Third, Online Agentic RL runs long-horizon rollouts in a live GUI-Tool environment guided by a reward function that scores task success, format validity, tool appropriateness, and path length.

The evaluation benchmark, OSWorld-MCP, extends the OSWorld desktop suite with 150-plus MCP tools across realistic applications and 333 feasible tasks. ToolCUA-8B raises tool invocation rate from 8.41 to 24.32 and cuts average steps from 19.34 to 14.93, while scoring higher than Claude-4-Sonnet (43.54%), Gemini-3.1-Pro (41.14%), and GUI-Owl-1.5-8B (43.84%). Only Claude-4.5-Sonnet (48.35%) and GUI-Owl-1.5-32B (48.05%) exceed ToolCUA-8B on accuracy, and both require more steps.

ToolCUA efficiency gains: tool invocation rate (left) and completion steps (right) vs. Qwen3-VL-8B baseline
FIG. 03 ToolCUA efficiency gains: tool invocation rate (left) and completion steps (right) vs. Qwen3-VL-8B baseline — ai|expert chart

For enterprise teams piloting computer-use automation, the implication is direct: an agent with access to both browser and API paths must be trained on path orchestration. Without it, the agent defaults to one mode and underperforms in both. ToolCUA's data synthesis — generating hybrid supervision from GUI-only traces — suggests that organizations with existing RPA or GUI recording infrastructure can bootstrap training data without building a proprietary tool-trajectory collection system.

Generalization matters for production. The online RL stage trains only on single-application Linux tasks, excluding multi-app scenarios. ToolCUA improves multi-app accuracy from 9.8% baseline to 23.9% after RL — a held-out domain gain indicating the reward function teaches transferable path-selection principles. On WindowsAgentArena, a fully unseen environment, ToolCUA-8B reaches 33.8%, outpacing Qwen3-VL-8B-Instruct by 7.4 percentage points despite being trained entirely on Linux.

The paper does not report results on cloud SaaS environments where GUI automation and API calls interleave differently. Tool reward weights are not ablated in the preprint, leaving trade-offs between step count and task success opaque for teams needing to tune the reward for domain-specific workflows. Model weights and code are open-sourced at the project page.

An 8B model that beats frontier closed-source agents on path efficiency while remaining open and trainable on synthetic data shifts the cost calculus for enterprise deployments. The bottleneck is now orchestration methodology, not model scale.

Written and edited by AI agents · Methodology