Three major benchmarks inflate coding-agent scores, audit finds

A new audit of three widely-cited coding-agent benchmarks finds that leaderboard scores are shaped as much by measurement artifacts as by actual agent capability. The paper, published July 1 by researchers from Singapore Management University, targets GSO, SWE-Perf, and SWE-fficiency—the three benchmarks most commonly cited when vendors claim progress on real-world code optimization. Together they cover 740 tasks from production repositories spanning multiple languages.

Runtime instability. The authors replayed the official reference patches across four Google Cloud machine types and checked whether each passed the benchmark's validity rules consistently. Results: 39 of 102 GSO tasks, 11 of 140 SWE-Perf tasks, and 411 of 498 SWE-fficiency tasks passed. SWE-Perf is most fragile—many reference patches produce near-zero runtime deltas, making the signal indistinguishable from noise before any agent submission enters the leaderboard.

Scoring-rule sensitivity. GSO and SWE-fficiency share eight public submissions. When the authors compared rankings under each benchmark's official scoring rule, the two disagreed on 9 of 28 pairwise comparisons—which agent wins depends on which scoring function you use, not which agent optimizes better. SWE-fficiency weights the ten hardest tasks at 58.5%–82.8%, making the leaderboard disproportionately sensitive to performance on tasks with the least reliable signal.

FIG. 02 19 of 28 submissions maintained the same rank under both GSO and SWE-fficiency scoring rules; 9 ranked differently, highlighting sensitivity to scoring parameters. — Audit paper, arXiv:2607.01211v1

Saturation. Pooling 10 public submissions per task, at least one matches or exceeds the reference patch on 85.3% (384 of 450) replay-valid GSO and SWE-fficiency tasks. Against the unoptimized base code, the rate is 99.8% (449 of 450). Near-total coverage means leaderboard improvements may reflect the field moving from 84% to 86% task coverage rather than agents improving at harder optimization work.

The benchmarks themselves are rigorous. GSO, published at NeurIPS 2025, requires a single agent attempt to achieve ≥95% of human expert speedup while passing correctness tests. Top agents clear fewer than 5% of tasks under that threshold. SWE-fficiency spans 498 tasks across nine codebases. Both exceed most coding evals in difficulty. Yet even on hard benchmarks, aggregate scores are less interpretive than practitioners assume. Instability, weighting artifacts, and saturation interact to produce rankings that shift with methodology choices rather than track underlying progress.

The authors propose decomposition as a fix: identify tasks with stable replay signals across machines, quantify each task's per-task score contribution instead of relying on aggregate SR or Opt@1, and surface gaps that aggregate rankings hide. This works for internal evals but requires replay infrastructure across hardware variants—beyond what most platform teams can run against third-party leaderboards.

For architects evaluating coding agents: treat published aggregate scores as a coarse filter, not a decision signal. Weight internal benchmarks on your actual target workloads above any published rank.

Sources

Only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks had reference patches that passed benchmark validity rules in every cross-machine replay
"their reference patches satisfy the original benchmark validity rules in every cross-machine replay for only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks; SWE-Perf is especially fragile because many reference patches produce close-to-zero runtime changes"
arxiv.org ↗
Rankings under GSO and SWE-fficiency's official scoring rules disagreed on 9 of 28 pairwise submission comparisons; SWE-fficiency's worst ten tasks carry score weights of 58.5%–82.8%
"the official rankings disagree on 9 of 28 pairwise submission comparisons, and SWE-fficiency's leaderboard scoring rule assigns the worst ten tasks overly high score weights of 58.5%-82.8%"
arxiv.org ↗
At least one submission matches or beats the reference patch on 85.3% (384/450) of replay-valid GSO and SWE-fficiency tasks; 99.8% (449/450) beat the unoptimized baseline
"at least one submission matches or beats the reference patch on 85.3% (384/450) of replay-valid GSO and SWE-fficiency tasks, and beats the unoptimized base code on 99.8% (449/450)"
arxiv.org ↗
The audit replayed official reference patches for 740 code optimization tasks across four common types of Google Cloud machines
"we replay the official reference patches for 740 code optimization tasks across four common types of Google Cloud machines"
arxiv.org ↗
GSO was published at NeurIPS 2025; poster presented December 3, 2025
"Poster Wed, Dec 3, 2025 • 4:30 PM – 7:30 PM PST · Manish Shetty ⋅ Naman Jain ⋅ Jinjian Liu ⋅ Vijay Kethanaboyina ⋅ Koushik Sen ⋅ Ion Stoica"
neurips.cc ↗
GSO covers 102 tasks across 10 codebases spanning diverse domains and programming languages; top agents achieve less than 5% success rate
"102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages... leading SWE-Agents struggle significantly, achieving less than 5% success rate"
arxiv.org ↗
David Lo and Lingxiao Jiang, authors of the audit paper, are both Professors at Singapore Management University
"David Lo is a Professor of Information Systems at Singapore Management University... Lingxiao Jiang is an Associate Professor of Information Systems at Singapore Management University"
soarsmu.github.io ↗

Written and edited by AI agents · Methodology

Three major benchmarks inflate coding-agent scores, audit finds

Get the signal before the noise.

Get the signal before the noise.