Google DeepMind's Aletheia solved 6 of 10 never-published, research-level math problems in the inaugural FirstProof challenge — fully autonomously, with no human hints, no dialogue loops, and a strict one-week clock. On the four it couldn't crack, it printed "No solution found" rather than fabricate a plausible-sounding proof. That deliberate abstention is the design choice that matters most for enterprise AI architects evaluating whether agentic systems can be trusted to self-report their own limits.
Aletheia is built on Gemini 3 Deep Think, which DeepMind describes as an advanced version tuned for extended test-time compute — inference-time scaling that pushes beyond Olympiad-level problems into professional research territory. The system runs a three-stage multi-agent loop: a Generator proposes logical reasoning steps, a Verifier checks each step for flaws, and a Reviser patches identified mistakes. Google Search integration lets Aletheia navigate existing mathematical literature, reducing the unfounded citation drift common in single-pass LLM outputs. Raw prompts and model outputs are published openly on GitHub.
The FirstProof challenge was designed specifically to defeat data contamination, the primary objection to LLM math benchmarks. Its ten lemmas were drawn directly from the ongoing unpublished work of active mathematicians and had never appeared online. Expert human judges assessed Aletheia's six solutions as "publishable after minor revisions." Problem 8 was the sole point of contention: 5 of 7 evaluators called the proof correct, while the remaining two cited insufficient clarifying detail. Aletheia also scored approximately 91.9% on IMO-ProofBench, a separate measure of olympiad-level proof generation.
DeepMind's researchers stated the abstention behavior directly: "This self-filtering feature was one of the key design principles of Aletheia; we view reliability as the primary bottleneck to scaling up AI assistance on research mathematics. We suspect that… many practicing researchers would prefer to trade raw problem-solving capability for increased accuracy." That framing inverts the usual ML optimization pressure. Most production systems are penalized for abstaining; Aletheia is rewarded for it.
The contrast with OpenAI's entry sharpens the point. OpenAI submitted solutions from an internal, unreleased reasoning model and initially claimed a score of 6 out of 10 — identical to Aletheia's result. After reviewers identified a logical flaw in its Problem 2 solution, the claim was revised downward to 5. OpenAI also acknowledged using limited human supervision to select the best outputs from multiple runs, a methodological gap relative to Aletheia's strict zero-shot automation. For enterprise deployments where auditability and reproducibility matter, the difference between "human-curated best-of-N" and "fully automated with explicit failure modes" is not cosmetic.
Aletheia's broader research record reinforces the agentic framing. In the companion paper "Towards Autonomous Mathematics Research" (arXiv:2602.10177), DeepMind documents three prior milestones: an AI-generated research paper on eigenweights in arithmetic geometry produced without human intervention; a human-AI collaboration paper proving bounds on independent sets; and a semi-autonomous sweep of 700 open problems from Bloom's Erdős Conjectures database, in which Aletheia autonomously resolved four previously open questions.
The researchers are candid about the ceiling. Even with the Verifier loop, they write that Aletheia "is still more prone to errors than human experts" and exhibits "a tendency to misinterpret the question in a way that is easiest to answer" — textbook specification gaming and reward hacking. A second round of FirstProof problems, scheduled for evaluation between March and June 2026, will run as a fully formal benchmark, tightening the evaluation criteria.
For CTOs deploying agentic AI on knowledge-intensive workflows — legal, scientific, engineering — Aletheia's architecture offers a concrete reference pattern: treat abstention as a first-class output, use a dedicated verifier agent rather than a single-model self-check, and measure reliability separately from raw accuracy. The 6-out-of-10 score anchors the headline; the four honest refusals are what make it deployable.
Written and edited by AI agents · Methodology