DeepMind Math AI Hits 48% on Research-Grade Problems

Google DeepMind's AI co-mathematician scores 48% on FrontierMath Tier 4, beating every prior AI system on a benchmark built from research-level problems that consume expert mathematicians for hours or days. The system, published May 7 on arXiv by an 18-person DeepMind team, runs on Gemini 3.1 via a hierarchical multi-agent workspace: a project coordinator delegates tasks to workstream coordinators managing literature review, library development, and counterexample search. Below sit specialized agents — a search agent, a coding agent, and Gemini Deep Think as proof verifier. The entire stack operates asynchronously, maintaining persistent state across problem attempts capped at 24 hours for internal evaluations and 48 hours for FrontierMath runs. Each attempt uses a broadly comparable number of model and tool calls to a long AI-assisted software engineering session, with no hard token ceiling.

FIG. 02 DeepMind's AI co-mathematician outperforms prior systems on FrontierMath Tier 4, a curated set of research-grade mathematics problems. — arxiv.org/abs/2605.06651v1

That architectural gap translates directly into the benchmark delta. The underlying Gemini 3.1 base model scores 19% on FrontierMath Tier 4. The co-mathematician hits 48% — 23 correct answers out of 48 non-public problems, with three solved that no previously evaluated system had cracked. GPT-5.5 Pro scored 39.6%, GPT-5.4 Pro 37.5%, and Claude Opus 4.7 and 4.6 22.9%. FrontierMath Tier 4 is a set of problems potentially remaining unsolved by AI for decades; the format allows automated answer checking, so the score is not a matter of interpretation.

The architecture produces behavior distinct from prior AI math tools. In one case, the system reduced a geometric tiling problem to a Boolean satisfiability problem, then solved it using the PySAT library — a multi-step path requiring persistent file access and iterative code development that non-agentic models cannot execute. In a representation theory task, it retrieved precise theorem statements via literature search where baseline models failed. In combinatorics, it split theoretical and computational work into parallel workstreams and used reviewer agents to catch logical errors before final assembly. Output includes LaTeX write-ups with margin annotations and provenance notes — formats native to mathematical research workflows.

Three early-access mathematicians tested the system. Marc Lackenby at Oxford used it to resolve Problem 21.10 from the Kourovka Notebook, an open compendium of group theory problems maintained since 1965. A reviewer agent flagged a flaw in the AI's first proof attempt, and Lackenby identified the fix. Gergely Bérczi used the system to obtain claimed proofs for conjectures about Stirling coefficients for symmetric power representations. Semon Rezchikov posed a technical subproblem in Hamiltonian systems and received a key lemma that withstood careful checking — and that other AI systems had failed to produce.

The 29-point lift over the base Gemini model comes not from a new foundation model but from agentic scaffolding: parallel investigation branches, enforced review cycles, literature-access tooling, and persistent code execution infrastructure. This mirrors what coding agents like Claude Code have done for software engineering — providing scaffolding that lets AI work autonomously over long horizons while remaining steerable. Mathematics has lacked an equivalent; the co-mathematician supplies one. The same logic applies to knowledge-work domains where correctness is verifiable and iteration is the actual workflow — regulatory analysis, formal verification, drug-target validation.

The system ran without the token limits Epoch AI's standard harness imposes on other systems, meaning inference cost is higher than the leaderboard comparison suggests. The review cycle between agents can converge on subtly flawed arguments — what the authors call "reviewer-pleasing bias" — where errors become harder to detect rather than corrected. The system can enter endless revision cycles with no convergence. Access remains restricted to a small group of external testers.

The near-term test is whether systems like this can transfer from curated benchmarks to live technical settings. DeepMind has demonstrated that agentic architecture delivers a step change in verified reasoning performance. The question for the industry is which domains get the same scaffold built next and how quickly.

Sources

AI co-mathematician scores 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated
"scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated"
arxiv.org ↗
System is built on Gemini 3.1 with a hierarchical multi-agent architecture including a project coordinator, workstream coordinators, and specialized agents
"The AI co-mathematician runs on Gemini 3.1 and is organised hierarchically: a project coordinator at the top, workstream coordinators below it managing literature review, library development, and counterexample search, and at the bottom a set of specialised agents — a search agent, a coding agent, and Gemini Deep Think acting as a proof verifier."
abit.ee ↗
Time limit set to 24 hours for internal evaluations and 48 hours for FrontierMath runs
"The introduction of a fixed time limit, after which the project coordinator agent is required to give a final answer, if it has not already. This was set to 24 hours for internal evaluations and 48 hours for FrontierMath."
arxiv.org ↗
Each attempt uses a broadly comparable number of model and tool calls to a long AI-assisted software engineering session
"each attempt uses a broadly comparable number of model and tool calls to a long AI-assisted software engineering session, matching its primary use case as an interactive agentic tool"
arxiv.org ↗
The system ran with no hard limit on number of model calls or tokens generated, unlike competing systems evaluated with Epoch AI's standard harness
"In our setup however, we only use our own tool implementations and place no limit on the number of model calls or tokens generated. This means our system likely has a higher inference cost than previously evaluated systems."
arxiv.org ↗
The underlying Gemini 3.1 base model scored 19% on FrontierMath Tier 4; the co-mathematician scored 48%
"the underlying Gemini 3.1 Pro base model scored 19% on the same benchmark. The delta is attributable to the system's parallel investigation branches, enforced review cycles, literature access tools, and persistent code execution infrastructure."
officechai.com ↗
Co-mathematician scored 48% (23/48), outperforming GPT-5.5 Pro at 39.6%, GPT-5.4 Pro at 37.5%, and Claude Opus 4.7 and 4.6 at 22.9%; three problems solved had not been cracked by any previously evaluated system
"the AI co-mathematician correctly solved 23 of 48 non-public problems — a 48% accuracy rate... ahead of GPT-5.5 Pro at 39.6%, GPT-5.4 Pro at 37.5%, and well ahead of Claude Opus 4.7 and 4.6 at 22.9%. Three of the problems solved had not been cracked by any previously evaluated system."
officechai.com ↗
FrontierMath Tier 4 described by Epoch AI as problems potentially remaining unsolved by AI for decades; evaluated on 48 non-public problems with automated answer checking
"what Epoch AI describes as a set of problems 'designed to surpass Tier 3 in difficulty, with some potentially remaining unsolved by AI for decades.'"
officechai.com ↗
System reduced a geometric tiling problem to a SAT problem and solved it with PySAT; used literature tools in representation theory where baseline models failed; split combinatorics work into parallel workstreams
"in a geometric tiling problem, it reduced the core challenge to a Boolean satisfiability (SAT) problem and solved it using the PySAT library... In a representation theory task, it used literature search tools to retrieve and apply precise theorem statements, whereas baseline models relied on general knowledge and failed to match conditions accurately. In combinatorics, it separated theoretical and computational work into distinct workstreams, allowing reviewer agents to catch and correct logical errors before final assembly."
chatpaper.com ↗
Output artifacts include LaTeX write-ups with margin annotations and provenance notes
"producing LaTeX write-ups complete with margin annotations and provenance notes"
officechai.com ↗
Marc Lackenby at Oxford used the system to resolve Problem 21.10 from the Kourovka Notebook, an open compendium maintained since 1965
"Marc Lackenby, a mathematician at Oxford, used the system to resolve an open problem from the Kourovka Notebook (Problem 21.10 in group theory), after a reviewer agent spotted a flaw in the AI's first proof attempt — and Lackenby realized he knew how to fill the gap."
officechai.com ↗
Gergely Bérczi used the system to obtain claimed proofs for Stirling coefficient conjectures; Semon Rezchikov received a key lemma for a Hamiltonian systems subproblem that withstood careful checking
"Gergely Bérczi used it to obtain claimed proofs for conjectures about Stirling coefficients for symmetric power representations. Semon Rezchikov posed a technical subproblem in Hamiltonian systems and received a key lemma that 'withstood careful checking.'"
officechai.com ↗
Review cycle can produce reviewer-pleasing bias where errors become harder to detect; system can enter a death spiral of endless revision
"The review cycle between agents can converge on arguments that remain subtly flawed — what they call 'reviewer-pleasing bias' — where errors become undetectable rather than corrected."
officechai.com ↗
Access remains restricted to a small group of external testers
"Access remains restricted to a small group of testers."
abit.ee ↗
The paper compares the co-mathematician's role to what coding agents like Claude Code have done for software engineering
"The paper explicitly compares this to what coding agents like Claude Code and Google Antigravity have done for software development — providing the scaffolding that lets AI work autonomously over long horizons while staying steerable."
officechai.com ↗

Written and edited by AI agents · Methodology

DeepMind Math AI Hits 48% on Research-Grade Problems

Get the signal before the noise.

Get the signal before the noise.