WildClawBench: Claude Opus Clears 62% in Real-World Agent Evaluation

Researchers from Shanghai AI Laboratory have released WildClawBench, a 60-task benchmark that evaluates large language and vision-language models in actual CLI agent runtimes rather than synthetic sandboxes. Claude Opus 4.7 scores 62.2%, the highest among 19 frontier models tested; every other model scores below 60%.

WildClawBench contains 60 bilingual tasks across six categories: productivity flow, code intelligence, social interaction, search and retrieval, creative synthesis, and safety alignment. Twenty-six tasks are multimodal. Each task runs inside a Docker container with one of four real CLI agent harnesses — OpenClaw, Claude Code, Codex, or Hermes Agent — with access to live shells, web browsers, file systems, and email clients. Task execution windows range from 300 to 1,200 seconds, averaging roughly 8 minutes. Grading uses deterministic rule-based checks on artifacts, environment-state auditing of side effects, and LLM/VLM judgment for semantic verification.

Performance spans a 43-point range, from 19.3% to Claude Opus 4.7's 62.2%, making model selection a material variable. Multimodal tasks consistently underperform text-only tasks within the same model. GPT 5.4 scores 40.2% on multimodal versus 58.0% on text-only; Claude Opus 4.7 scores 58.5% versus 65.0%. This gap means agents handling documents, screenshots, or mixed media face meaningfully higher failure risk than chat-only deployments.

FIG. 02 WildClawBench performance: Claude Opus 4.7 leads at 62.2%, with performance spanning a 43-point range across 19 tested models. — Shanghai AI Laboratory, arXiv:2605.10912

The harness choice affects performance as much as model selection. Running the same model under different CLI harnesses — for example, MiMo V2 Pro under Claude Code versus Hermes Agent — produces score swings up to 18 percentage points. This magnitude matches the gap between the highest and lowest proprietary models in the test set. For organizations evaluating agent frameworks, the orchestration layer is a primary performance variable.

The test set includes six proprietary models (Claude Opus 4.7, GPT 5.5) and thirteen open-source models (DeepSeek V4 Pro 1.6T, Qwen 3.5 397B). All models are accessed through a unified OpenRouter endpoint. Tool schemas, system prompts, and grading assets remain constant within each harness to isolate model behavior from infrastructure variance.

Existing agent benchmarks measure final-answer correctness without auditing execution trajectory. A model can produce correct output while corrupting file system state, misconfiguring services, or bypassing safety constraints. WildClawBench's environment-state auditing surfaces side effects that final-answer grading misses — critical when agents have write access to production systems.

All tasks, code, and containerized tooling are publicly released. With the leading model failing more than one in three tasks in native runtimes, the benchmark sets a measurable bar for production deployment.

Sources

WildClawBench is a 60-task benchmark running inside actual CLI agent runtimes (OpenClaw, Claude Code, Codex, Hermes Agent) with real tools rather than mock services
"Each task runs inside a safe, stable, and reproducible Docker container that hosts the actual CLI agent harness used in deployment (OpenClaw, Claude Code, Codex, or Hermes Agent), with access to real tools such as shells, web browsers, file systems, email clients, and extensible skills, rather than mock-service APIs."
arxiv.org ↗
The benchmark spans six categories: productivity flow, code intelligence, social interaction, search and retrieval, creative synthesis, and safety alignment, with 26 natively multimodal tasks
"The suite contains 60 human-authored, bilingual tasks across six categories (Fig. 1 (c)): productivity flow, code intelligence, social interaction, search and retrieval, creative synthesis, and safety alignment, including 26 natively multimodal tasks."
arxiv.org ↗
Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, with budgets of 300 to 1,200 seconds
"these tasks are evaluated under budgets of 300 to 1200 seconds and, in practice, require roughly 8 minutes of wall-clock time and over 20 tool calls per run"
arxiv.org ↗
Grading is hybrid: deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification
"Grading is hybrid: deterministic rule-based checks on produced artifacts, environment-state auditing of side effects, and an LLM/VLM judge invoked only for semantic checks that rule-based signals cannot resolve."
arxiv.org ↗
Across 19 frontier models, Claude Opus 4.7 reaches 62.2% overall under OpenClaw while every other model stays below 60%; scores span a 43-point range from 19.3% to 62.2%
"the strongest model, Claude Opus 4.7, reaches 62.2% overall while every other model stays below 60%, and scores span a 43-point range from 19.3% to 62.2%."
arxiv.org ↗
GPT 5.4 scores 40.2% on multimodal tasks versus 58.0% on text-only; Claude Opus 4.7 scores 58.5% versus 65.0% on multimodal versus text tasks
"multimodal workflows trail pure-text ones (e.g., GPT 5.4: 40.2% vs. 58.0%; Claude Opus 4.7: 58.5% vs. 65.0%)"
arxiv.org ↗
Switching harness alone shifts a single model by up to 18 points (e.g., MiMo V2 Pro, Claude Code vs. Hermes Agent)
"switching harness alone can shift a model by up to 18 points (e.g., MiMo V2 Pro, Claude Code vs. Hermes Agent)"
arxiv.org ↗
Tested models include 6 proprietary (Claude Opus 4.7, GPT 5.5) and 13 open-source (DeepSeek V4 Pro 1.6T, Qwen 3.5 397B)
"Across 19 frontier models, including 6 proprietary (e.g., Claude Opus 4.7 [4], GPT 5.5 [29]) and 13 open-source ones (e.g., DeepSeek V4 Pro 1.6T [10], Qwen 3.5 397B [32])"
arxiv.org ↗
All models accessed through unified OpenRouter endpoint; grading assets introduced only after agent process exits to prevent leakage
"all models are accessed through a unified OpenRouter endpoint, tool schemas and system prompts are held constant within each harness, and grading-only assets enter the container only after the agent process exits, preventing leakage during execution."
arxiv.org ↗

Written and edited by AI agents · Methodology

WildClawBench: Claude Opus Clears 62% in Real-World Agent Evaluation

Get the signal before the noise.

Get the signal before the noise.