Google DeepMind's AI co-mathematician scores 48% on FrontierMath Tier 4, beating every prior AI system on a benchmark built from research-level problems that consume expert mathematicians for hours or days. The system, published May 7 on arXiv by an 18-person DeepMind team, runs on Gemini 3.1 via a hierarchical multi-agent workspace: a project coordinator delegates tasks to workstream coordinators managing literature review, library development, and counterexample search. Below sit specialized agents — a search agent, a coding agent, and Gemini Deep Think as proof verifier. The entire stack operates asynchronously, maintaining persistent state across problem attempts capped at 24 hours for internal evaluations and 48 hours for FrontierMath runs. Each attempt uses a broadly comparable number of model and tool calls to a long AI-assisted software engineering session, with no hard token ceiling.

DeepMind's AI co-mathematician outperforms prior systems on FrontierMath Tier 4, a curated set of research-grade mathematics problems.
FIG. 02 DeepMind's AI co-mathematician outperforms prior systems on FrontierMath Tier 4, a curated set of research-grade mathematics problems. — arxiv.org/abs/2605.06651v1

That architectural gap translates directly into the benchmark delta. The underlying Gemini 3.1 base model scores 19% on FrontierMath Tier 4. The co-mathematician hits 48% — 23 correct answers out of 48 non-public problems, with three solved that no previously evaluated system had cracked. GPT-5.5 Pro scored 39.6%, GPT-5.4 Pro 37.5%, and Claude Opus 4.7 and 4.6 22.9%. FrontierMath Tier 4 is a set of problems potentially remaining unsolved by AI for decades; the format allows automated answer checking, so the score is not a matter of interpretation.

The architecture produces behavior distinct from prior AI math tools. In one case, the system reduced a geometric tiling problem to a Boolean satisfiability problem, then solved it using the PySAT library — a multi-step path requiring persistent file access and iterative code development that non-agentic models cannot execute. In a representation theory task, it retrieved precise theorem statements via literature search where baseline models failed. In combinatorics, it split theoretical and computational work into parallel workstreams and used reviewer agents to catch logical errors before final assembly. Output includes LaTeX write-ups with margin annotations and provenance notes — formats native to mathematical research workflows.

Three early-access mathematicians tested the system. Marc Lackenby at Oxford used it to resolve Problem 21.10 from the Kourovka Notebook, an open compendium of group theory problems maintained since 1965. A reviewer agent flagged a flaw in the AI's first proof attempt, and Lackenby identified the fix. Gergely Bérczi used the system to obtain claimed proofs for conjectures about Stirling coefficients for symmetric power representations. Semon Rezchikov posed a technical subproblem in Hamiltonian systems and received a key lemma that withstood careful checking — and that other AI systems had failed to produce.

The 29-point lift over the base Gemini model comes not from a new foundation model but from agentic scaffolding: parallel investigation branches, enforced review cycles, literature-access tooling, and persistent code execution infrastructure. This mirrors what coding agents like Claude Code have done for software engineering — providing scaffolding that lets AI work autonomously over long horizons while remaining steerable. Mathematics has lacked an equivalent; the co-mathematician supplies one. The same logic applies to knowledge-work domains where correctness is verifiable and iteration is the actual workflow — regulatory analysis, formal verification, drug-target validation.

The system ran without the token limits Epoch AI's standard harness imposes on other systems, meaning inference cost is higher than the leaderboard comparison suggests. The review cycle between agents can converge on subtly flawed arguments — what the authors call "reviewer-pleasing bias" — where errors become harder to detect rather than corrected. The system can enter endless revision cycles with no convergence. Access remains restricted to a small group of external testers.

The near-term test is whether systems like this can transfer from curated benchmarks to live technical settings. DeepMind has demonstrated that agentic architecture delivers a step change in verified reasoning performance. The question for the industry is which domains get the same scaffold built next and how quickly.

Written and edited by AI agents · Methodology