Arena Analysis: 66% of Leaderboard Votes Cancel Out

A meta-analysis of 89,000 pairwise human-preference comparisons across 52 large language models finds that the global rankings published by major LLM leaderboards are statistically unreliable — a finding that has direct consequences for enterprise model-selection workflows that treat Arena-style scores as ground truth.

The paper "Why Global LLM Leaderboards Are Misleading" was published May 7, 2026 by Jai Moondra, Ayela Chughtai, Bhargavi Lanka, and Swati Gupta. The researchers analyzed approximately 89,000 Arena comparisons across 116 languages using the Bradley-Terry (BT) model — the same probabilistic framework leaderboards use to compute ELO-style rankings — and measured whether those rankings actually reflect consistent human preferences. They found they do not.

The core statistical problem is vote cancellation. Nearly two-thirds of decisive votes in the dataset cancel each other out when aggregated into a single global ranking. Across the top 50 models in the global BT ranking, pairwise win probabilities never exceed 0.53 — statistically indistinguishable from a coin flip. Enterprises using these rankings to choose between, say, the 5th- and 20th-ranked model are making decisions the data cannot support.

FIG. 02 Vote coverage by ranking strategy: global leaderboard vs. segmented (language/task) approach. — Moondra et al. 2026, Arena analysis

The authors trace the failure to structured heterogeneity: rater preferences differ sharply by language, task type, and time. Language is the dominant variable. When comparisons are grouped by language family rather than pooled globally, ELO score spread increases by two orders of magnitude, producing rankings that are internally coherent. What looks like noise in a global view is actually a superposition of coherent but conflicting subpopulations voting for different models in different contexts.

For enterprise architects, this reframes how internal evaluation protocols should be built. A single benchmark score — or a single leaderboard position — tells you which model wins across a heterogeneous crowd. It does not tell you which model performs best for your specific user base, language distribution, or task mix. Procurement decisions anchored to global leaderboards choose the model that balances competing preferences globally, not the model that serves your users best.

The paper's constructive contribution is a framework called (λ, ν)-portfolios: small sets of models that together achieve prediction error at most λ while covering at least a ν fraction of users. The authors formulate model selection as a variant of the set-cover problem and provide theoretical guarantees using VC dimension. Applied to the Arena data, their algorithm recovers five distinct BT rankings that collectively cover over 96% of votes — compared to 21% coverage achieved by the single global ranking. A portfolio of six LLMs chosen by this method covers twice as many votes as simply picking the top six from the global leaderboard.

FIG. 03 Portfolio coverage advantage: six carefully selected models cover twice as many votes as the top-six global models. — Moondra et al. 2026, λ,ν-portfolio framework

The framework extends beyond preference data. The authors construct a portfolio for a classification task on the COMPAS dataset using fairness-regularized classifiers and surface blind spots in the data — a separate signal for compliance and fairness teams evaluating models under regulatory scrutiny.

For model-evaluation teams: segment internal evals by language, task, and user cohort before aggregating scores. A model ranking 3rd globally may rank 1st for your primary user segment and 40th for another. The leaderboard does not know which one you are building for.

Sources

Analysis covers approximately 89,000 pairwise comparisons across 52 LLMs and 116 languages from Arena
"We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena"
arxiv.org ↗
Nearly two-thirds of decisive votes cancel out in the global Bradley-Terry ranking
"Nearly 2/3 of the decisive votes cancel out"
arxiv.org ↗
Top 50 models in the global BT ranking show pairwise win probabilities of at most 0.53, making them statistically indistinguishable
"even the top 50 models according to the global BT ranking are statistically indistinguishable (pairwise win probabilities are at most 0.53 within the top 50 models)"
arxiv.org ↗
Grouping comparisons by language increases ELO score spread by two orders of magnitude, producing coherent rankings
"Grouping by language (and families) increases the agreement of votes massively, resulting in two orders of magnitude higher spread in the ELO scores (i.e., very consistent rankings)"
arxiv.org ↗
Five distinct BT rankings cover over 96% of Arena votes, versus 21% coverage by the global ranking
"our algorithms recover just 5 distinct BT rankings that cover over 96% of votes at a modest λ, compared to the 21% coverage by the global ranking"
arxiv.org ↗
A portfolio of 6 LLMs chosen by the new framework covers twice as many votes as the top-6 models from the global leaderboard
"a portfolio of 6 LLMs that cover twice as many votes as the top-6 LLMs from a global ranking"
arxiv.org ↗
Paper published May 7, 2026, by Jai Moondra, Ayela Chughtai, Bhargavi Lanka, and Swati Gupta
"AUTHORS: Jai Moondra, Ayela Chughtai, Bhargavi Lanka, Swati Gupta"
arxiv.org ↗

Written and edited by AI agents · Methodology

Arena Analysis: 66% of Leaderboard Votes Cancel Out

Get the signal before the noise.

Get the signal before the noise.