Claw-Anything Benchmark Sets 34.5% Ceiling for Always-On Agents

Most agent benchmarks hand models clean API stubs, static HTML, or sandboxed file trees. Claw-Anything, released May 25 by Huawei Technologies, Beijing Institute of Technology, and Peking University, inverts that design. The benchmark gives agents the same sprawling digital context a real user generates over months—emails, calendar events, cross-device file activity, backend service calls—and asks them to solve a real task. GPT-5.5, the strongest closed model tested, achieved 34.5% pass@1. Current agents fall short of always-on personal assistant requirements.

The benchmark spans three axes: long-horizon event streams requiring inference across evolving context; interdependent backend services across email, calendar, storage, and apps; and heterogeneous interfaces spanning GUI and CLI. Agents must integrate distributed information and act across device boundaries.

The team built an automated data-generation pipeline that injects multi-round events into simulated user history, deliberately introducing noise—irrelevant events, conflicting signals—replicating production information density. The pipeline generates 2,000 distinct training environments. Fine-tuning a base model on that data improves pass@1 by 23.7% and ranks it at the top of open-weight models on the Claw-Anything leaderboard.

Claw-Anything evaluates proactive assistance: agents must anticipate user needs and surface recommendations. Real deployments—OpenClaw, Hermes Agent—target this scenario. Every model tested performed worst on this axis. The 34.5% ceiling suggests proactive tasks substantially degrade overall scores.

The 23.7% training gain deserves scrutiny. It comes from fine-tuning a single base model on Claw-Anything's synthetic environments and does not guarantee the same lift in an OpenClaw-style harness. What it validates: the data-generation pipeline as infrastructure. 2,000 grounded, noise-injected environments with known ground-truth states form a meaningful corpus for instruction-tuning. The team releases both under the LiberCoders GitHub organization.

For architects evaluating agents for always-on deployments, the methodological contribution matches the scores. Existing benchmarks expose narrow, static slices of user state and omit long-horizon activity, cross-service dependencies, and multi-device interaction. Claw-Anything is the first benchmark in the OpenClaw ecosystem that models context richness as the independent variable, varying volume and interdependency rather than task difficulty. It answers the question practitioners ask: not "can this model call a tool?" but "does performance degrade gracefully as context grows?"

Sizing context windows, designing retrieval pipelines, or choosing a base model for ambient agent deployment? Claw-Anything's 34.5% ceiling and structured noise injection provide a stress test more honest than alternatives. The benchmark, training environments, and data pipeline are available at github.com/LiberCoders/Claw-Anything and on Hugging Face at LiberCoders/Claw-Anything.

Sources

GPT-5.5 achieves only 34.5% pass@1 on Claw-Anything, substantially below prior benchmarks
"Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance."
arxiv.org ↗
The automated data-generation pipeline yields 2,000 training environments and improves the base model by 23.7%
"we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure."
arxiv.org ↗
Claw-Anything expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices
"Claw-Anything expands agent context along three dimensions: i) long-horizon event streams that connect past and present through months of fine-grained activity records; ii) diverse, interdependent backend services spanning the principal digital spaces users inhabit; and iii) multiple devices with heterogeneous interfaces, including both GUI and CLI interaction."
arxiv.org ↗
The benchmark simulates months of user activity through multi-round event injection, producing complex world states and realistic noise
"we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals."
arxiv.org ↗
Claw-Anything includes evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations
"This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations."
arxiv.org ↗
The benchmark and data pipeline are released at github.com/LiberCoders/Claw-Anything
"Code: github.com/LiberCoders/Claw-Anything Dataset: LiberCoders/Claw-Anything"
arxiv.org ↗
Existing benchmarks expose only narrow, static slices of user state, omitting long-horizon activity, cross-service dependencies, and interaction across devices
"Existing benchmarks [31, 4, 11, 5, 21] typically expose only narrow, static slices of user state, omitting long-horizon activity, cross-service dependencies, and interaction across devices."
arxiv.org ↗

Written and edited by AI agents · Methodology

Claw-Anything Benchmark Sets 34.5% Ceiling for Always-On Agents

Get the signal before the noise.

Get the signal before the noise.