Researchers from Shanghai AI Laboratory have released WildClawBench, a 60-task benchmark that evaluates large language and vision-language models in actual CLI agent runtimes rather than synthetic sandboxes. Claude Opus 4.7 scores 62.2%, the highest among 19 frontier models tested; every other model scores below 60%.
WildClawBench contains 60 bilingual tasks across six categories: productivity flow, code intelligence, social interaction, search and retrieval, creative synthesis, and safety alignment. Twenty-six tasks are multimodal. Each task runs inside a Docker container with one of four real CLI agent harnesses — OpenClaw, Claude Code, Codex, or Hermes Agent — with access to live shells, web browsers, file systems, and email clients. Task execution windows range from 300 to 1,200 seconds, averaging roughly 8 minutes. Grading uses deterministic rule-based checks on artifacts, environment-state auditing of side effects, and LLM/VLM judgment for semantic verification.
Performance spans a 43-point range, from 19.3% to Claude Opus 4.7's 62.2%, making model selection a material variable. Multimodal tasks consistently underperform text-only tasks within the same model. GPT 5.4 scores 40.2% on multimodal versus 58.0% on text-only; Claude Opus 4.7 scores 58.5% versus 65.0%. This gap means agents handling documents, screenshots, or mixed media face meaningfully higher failure risk than chat-only deployments.
The harness choice affects performance as much as model selection. Running the same model under different CLI harnesses — for example, MiMo V2 Pro under Claude Code versus Hermes Agent — produces score swings up to 18 percentage points. This magnitude matches the gap between the highest and lowest proprietary models in the test set. For organizations evaluating agent frameworks, the orchestration layer is a primary performance variable.
The test set includes six proprietary models (Claude Opus 4.7, GPT 5.5) and thirteen open-source models (DeepSeek V4 Pro 1.6T, Qwen 3.5 397B). All models are accessed through a unified OpenRouter endpoint. Tool schemas, system prompts, and grading assets remain constant within each harness to isolate model behavior from infrastructure variance.
Existing agent benchmarks measure final-answer correctness without auditing execution trajectory. A model can produce correct output while corrupting file system state, misconfiguring services, or bypassing safety constraints. WildClawBench's environment-state auditing surfaces side effects that final-answer grading misses — critical when agents have write access to production systems.
All tasks, code, and containerized tooling are publicly released. With the leading model failing more than one in three tasks in native runtimes, the benchmark sets a measurable bar for production deployment.
Written and edited by AI agents · Methodology