Claude Opus 4.8 tops Artificial Analysis leaderboard; first frontier model to complete every agentic case end-to-end
Anthropic released Claude Opus 4.8 on May 28, 2026, shipping incremental improvements across reasoning, coding, and agentic benchmarks at the same price as Opus 4.7 ($5/$25 per million input/output tokens). On the Artificial Analysis Intelligence Index leaderboard, Claude now leads with a 61.4% blended score and 1545 Elo, dethroning OpenAI's GPT-5.5. The new model shows measurable gains: Opus 4.8 scores 1890 on GDPval-AA (knowledge work occupations), +137 points from Opus 4.7 and +121 ahead of GPT-5.5 xhigh, implying ~67% win rate on head-to-head task completion.
On agentic benchmarks, Claude Opus 4.8 is the first frontier model to complete every case end-to-end on Anthropic's Super-Agent benchmark, beating prior Opus versions and GPT-5.5 at cost parity. On CursorBench (IDE-integrated code work), it exceeds prior Opus across all effort levels, using 35% fewer output tokens than Opus 4.7 while achieving 15% fewer turns per task. Legal Agent Benchmark shows highest recorded score and first model to break 10% on all-pass standard. Tool calling is measurably more efficient: fewer steps for equivalent intelligence, cleaner execution patterns.
The release highlights a shift in evaluation focus: rather than publishing new benchmark scores, Anthropic emphasized reliability metrics and agentic judgment. Opus 4.8 exhibits lower comment verbosity, better self-correction (catches its own mistakes, pushes back on unsound plans), and more honest refusals compared to 4.7. On hallucination rates, Anthropic's models hold 35.9%, substantially lower than competing labs. A new 'effort control' feature on claude.ai lets users specify compute intensity; fast mode runs at 2.5x speed for $10/$50 (3x cheaper than previous Fast tiers).
For architects: Claude's reign on the overall leaderboard (AA-Omniscience at 27.4, HLE first place) reflects strong generalist reasoning, but GPT-5.5 remains the coding leader (59.1% on Scale's SWE-bench Pro vs. Opus 4.8's 56.7%). Open-weights models (DeepSeek-V4-Pro-Max 80.6%, Qwen3.7 80.4%, MiniMax M3 80.5%) now cluster tightly below closed-frontier models. For agentic inference and knowledge-work automation, Opus 4.8 offers measurable judgment improvements; for raw coding velocity, GPT-5.5 remains the choice. Pricing parity with 4.7 eliminates cost justification for staying on older versions in production.
Sources
- Primary source
- artificialanalysis.ai
“Anthropic retakes #1 on GDPval-AA and advances in terminal use and scientific reasoning; Claude Opus 4.8 reaches #2 on AA-Omniscience”
- morphllm.com
“Claude Fable 5 (95% SWE-bench Verified), Opus 4.8 (88.6% Verified), GPT-5.4 leads SWE-bench Pro at 59.1%”
- appwrite.io
“Claude Opus 4.8 takes #1 on Appwrite Arena without-skills board at 97.4%, first model to beat Opus 4.7”
- renovateqr.com
“Anthropic dropped Opus 4.8 officially dethroning GPT-5.5 on AA leaderboard with 61.4% blended score and 1545 Elo”