RESEARCHBY AI|EXPERT SCOUT· Sunday, April 26, 2026· 4 MIN READ
Tested on 19 Frontier Models, MathDuels Decouples Authoring From Solving Skill
Researchers from Cornell and collaborators propose MathDuels, a benchmark where LLMs simultaneously author adversarial math problems and solve problems authored by peers — sidestepping the contamination and ceiling effects that have made static evals unreliable for ranking GPT-4-class and above models. Because difficulty scales dynamically with the field of participants, the benchmark stays discriminative even as models improve. Enterprise AI teams selecting frontier models for reasoning-heavy workloads finally have a methodology that doesn't go stale.
Generative Imagery
Self-play evaluation decouples authoring from solving skill.FIG. 01
Researchers at the University of Pennsylvania have released MathDuels, a self-play evaluation framework that forces large language models to author adversarial math problems and solve problems written by competing models — producing a benchmark whose difficulty scales with the field rather than saturating at a fixed ceiling.
Benchmark saturation is already measurable. Static benchmarks like MATH and GSM8K have lost discriminative power for frontier-tier systems, and even annually refreshed competition sets are deteriorating: recent results show strong model performance on AIME 2026 problems shortly after release. No fixed problem pool can keep pace when model capabilities advance faster than new problems can be authored.
MathDuels sidesteps the problem structurally. Each of N participating models authors K problems through a three-stage pipeline — meta-prompting, problem generation, and difficulty amplification — then attempts every problem authored by every other model. Answers are verified symbolically; any problem that defeats at least one solver triggers a validity check to screen out ill-posed or ambiguous questions. The outcome matrix is then fed into a Rasch model that jointly estimates solver ability and problem difficulty, with author quality derived from the aggregate difficulty of each model's generated problems. Scores on both axes — authoring and solving — are reported separately on a public leaderboard at mathduels.ai.
Experiments across 19 frontier models surface two findings with direct implications for enterprise model selection. First, authoring capability and solving capability are partially decoupled: strong solvers are not reliably strong authors, indicating these are distinct mathematical competency axes that single-role benchmarks conflate or miss. Second, as stronger models enter the arena, they produce problems that defeat previously dominant solvers — so the benchmark's discriminative range co-evolves with participant strength indefinitely.
FIG. 02Authoring and solving skills are partially decoupled: strong solvers among the 19 tested models are not necessarily strong problem authors.— MathDuels · arxiv.org/abs/2604.21916
For AI architects running model-selection cycles for reasoning-heavy workloads — contract analysis, scientific computation, financial modeling — this matters operationally. Evaluations built on MATH or GSM8K today may return near-identical scores for models whose real-world reasoning gap is substantial. MathDuels offers a methodology that stays calibrated over time without requiring a new static dataset for each evaluation round.
The framework draws a parallel to a 16th-century Venetian mathematical duel between Niccolò Tartaglia and Antonio Maria Fior, in which each deposited thirty problems with a notary and the outcome exposed a capability gap no static test could have surfaced. That framing also points to a limitation: adversarial self-play evaluations are sensitive to the composition of the participant pool. A round that excludes a model tier shifts the difficulty distribution and can make scores across rounds incomparable. The authors address this partially through the Rasch model's joint estimation, but comparability across differently composed rounds remains an open methodological question.
A live leaderboard that ingests new models continuously is a serious operational commitment — and the real test of whether MathDuels holds its discriminative edge as the next generation of reasoning models arrives.