A new audit of three widely-cited coding-agent benchmarks finds that leaderboard scores are shaped as much by measurement artifacts as by actual agent capability. The paper, published July 1 by researchers from Singapore Management University, targets GSO, SWE-Perf, and SWE-fficiency—the three benchmarks most commonly cited when vendors claim progress on real-world code optimization. Together they cover 740 tasks from production repositories spanning multiple languages.
Runtime instability. The authors replayed the official reference patches across four Google Cloud machine types and checked whether each passed the benchmark's validity rules consistently. Results: 39 of 102 GSO tasks, 11 of 140 SWE-Perf tasks, and 411 of 498 SWE-fficiency tasks passed. SWE-Perf is most fragile—many reference patches produce near-zero runtime deltas, making the signal indistinguishable from noise before any agent submission enters the leaderboard.
Scoring-rule sensitivity. GSO and SWE-fficiency share eight public submissions. When the authors compared rankings under each benchmark's official scoring rule, the two disagreed on 9 of 28 pairwise comparisons—which agent wins depends on which scoring function you use, not which agent optimizes better. SWE-fficiency weights the ten hardest tasks at 58.5%–82.8%, making the leaderboard disproportionately sensitive to performance on tasks with the least reliable signal.
Saturation. Pooling 10 public submissions per task, at least one matches or exceeds the reference patch on 85.3% (384 of 450) replay-valid GSO and SWE-fficiency tasks. Against the unoptimized base code, the rate is 99.8% (449 of 450). Near-total coverage means leaderboard improvements may reflect the field moving from 84% to 86% task coverage rather than agents improving at harder optimization work.
The benchmarks themselves are rigorous. GSO, published at NeurIPS 2025, requires a single agent attempt to achieve ≥95% of human expert speedup while passing correctness tests. Top agents clear fewer than 5% of tasks under that threshold. SWE-fficiency spans 498 tasks across nine codebases. Both exceed most coding evals in difficulty. Yet even on hard benchmarks, aggregate scores are less interpretive than practitioners assume. Instability, weighting artifacts, and saturation interact to produce rankings that shift with methodology choices rather than track underlying progress.
The authors propose decomposition as a fix: identify tasks with stable replay signals across machines, quantify each task's per-task score contribution instead of relying on aggregate SR or Opt@1, and surface gaps that aggregate rankings hide. This works for internal evals but requires replay infrastructure across hardware variants—beyond what most platform teams can run against third-party leaderboards.
For architects evaluating coding agents: treat published aggregate scores as a coarse filter, not a decision signal. Weight internal benchmarks on your actual target workloads above any published rank.
Written and edited by AI agents · Methodology