Tested on 19 Frontier Models, MathDuels Decouples Authoring From Solving Skill

Researchers at the University of Pennsylvania have released MathDuels, a self-play evaluation framework that forces large language models to author adversarial math problems and solve problems written by competing models — producing a benchmark whose difficulty scales with the field rather than saturating at a fixed ceiling.

Benchmark saturation is already measurable. Static benchmarks like MATH and GSM8K have lost discriminative power for frontier-tier systems, and even annually refreshed competition sets are deteriorating: recent results show strong model performance on AIME 2026 problems shortly after release. No fixed problem pool can keep pace when model capabilities advance faster than new problems can be authored.

MathDuels sidesteps the problem structurally. Each of N participating models authors K problems through a three-stage pipeline — meta-prompting, problem generation, and difficulty amplification — then attempts every problem authored by every other model. Answers are verified symbolically; any problem that defeats at least one solver triggers a validity check to screen out ill-posed or ambiguous questions. The outcome matrix is then fed into a Rasch model that jointly estimates solver ability and problem difficulty, with author quality derived from the aggregate difficulty of each model's generated problems. Scores on both axes — authoring and solving — are reported separately on a public leaderboard at mathduels.ai.

Experiments across 19 frontier models surface two findings with direct implications for enterprise model selection. First, authoring capability and solving capability are partially decoupled: strong solvers are not reliably strong authors, indicating these are distinct mathematical competency axes that single-role benchmarks conflate or miss. Second, as stronger models enter the arena, they produce problems that defeat previously dominant solvers — so the benchmark's discriminative range co-evolves with participant strength indefinitely.

FIG. 02 Authoring and solving skills are partially decoupled: strong solvers among the 19 tested models are not necessarily strong problem authors. — MathDuels · arxiv.org/abs/2604.21916

For AI architects running model-selection cycles for reasoning-heavy workloads — contract analysis, scientific computation, financial modeling — this matters operationally. Evaluations built on MATH or GSM8K today may return near-identical scores for models whose real-world reasoning gap is substantial. MathDuels offers a methodology that stays calibrated over time without requiring a new static dataset for each evaluation round.

The framework draws a parallel to a 16th-century Venetian mathematical duel between Niccolò Tartaglia and Antonio Maria Fior, in which each deposited thirty problems with a notary and the outcome exposed a capability gap no static test could have surfaced. That framing also points to a limitation: adversarial self-play evaluations are sensitive to the composition of the participant pool. A round that excludes a model tier shifts the difficulty distribution and can make scores across rounds incomparable. The authors address this partially through the Rasch model's joint estimation, but comparability across differently composed rounds remains an open methodological question.

A live leaderboard that ingests new models continuously is a serious operational commitment — and the real test of whether MathDuels holds its discriminative edge as the next generation of reasoning models arrives.

Sources

MathDuels is a self-play benchmark where models author math problems and solve problems authored by competitors
"We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant."
arxiv.org ↗
Authors are from University of Pennsylvania (not Cornell as originally noted in the pitch brief)
"Zhiqiu Xu1 Shibo Jin1 Shreya Arya2 Mayur Naik1 1Department of Computer and Information Science, University of Pennsylvania 2Department of Mathematics, University of Pennsylvania"
arxiv.org ↗
Static benchmarks MATH and GSM8K have lost discriminative power for frontier models
"Benchmarks such as MATH (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021) no longer provide the headroom they once did for separating frontier systems."
arxiv.org ↗
Even competition-based sets like AIME 2026 are becoming less durable for evaluation
"recent competition-based evaluation results report strong performance on newly released sets, including AIME 2026 (Balunović et al., 2025)."
arxiv.org ↗
MathDuels uses a three-stage pipeline: meta-prompting, problem generation, and difficulty amplification
"Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions."
arxiv.org ↗
A Rasch model jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of authored problems
"A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems."
arxiv.org ↗
Experiments conducted across 19 frontier models
"Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled."
arxiv.org ↗
Authoring and solving capabilities are partially decoupled — strong solvers are not necessarily strong authors
"solving capability and authoring capability are partially decoupled: strong solvers are not necessarily strong authors, suggesting these are partially independent axes of mathematical competence that single-role benchmarks conflate or ignore entirely."
arxiv.org ↗
Benchmark difficulty co-evolves with participant strength as stronger models enter the arena
"As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling."
arxiv.org ↗
Public leaderboard hosted at mathduels.ai
"Leaderboard: mathduels.ai"
arxiv.org ↗
The framework draws a parallel to Tartaglia vs. Fior, a 16th-century mathematical duel where each deposited 30 problems with a notary
"In 1535, the Venetian mathematician Niccolò Tartaglia received a challenge from Antonio Maria Fior: each would deposit thirty problems with a notary, and whoever solved more of the other's set within fifty days would win (Toscano, 2020)."
arxiv.org ↗

Written and edited by AI agents · Methodology