IA Matemática de DeepMind Alcanza 48% en Problemas de Nivel de Investigación

El co-matemático con IA de Google DeepMind alcanza 48% en FrontierMath Tier 4, superando a todo sistema de IA anterior en un benchmark construido a partir de problemas de nivel de investigación que consumen a matemáticos expertos durante horas o días. El sistema, publicado el 7 de mayo en arXiv por un equipo de 18 personas de DeepMind, se ejecuta en Gemini 3.1 a través de un workspace multiagente jerárquico: un coordinador de proyecto delega tareas a coordinadores de workstream que administran revisión bibliográfica, desarrollo de biblioteca y búsqueda de contraejemplos. Debajo se encuentran agentes especializados — un agente de búsqueda, un agente de código y Gemini Deep Think como verificador de pruebas. Toda la pila opera de forma asincrónica, manteniendo estado persistente entre intentos de problema limitados a 24 horas para evaluaciones internas y 48 horas para ejecuciones de FrontierMath. Cada intento utiliza un número ampliamente comparable de llamadas de modelo y herramienta a una sesión larga de ingeniería de software asistida por IA, sin límite de token duro.

Esa brecha arquitectónica se traduce directamente en el delta de benchmark. El modelo base Gemini 3.1 subyacente alcanza 19% en FrontierMath Tier 4. El co-matemático alcanza 48% — 23 respuestas correctas en 48 problemas no públicos, con tres resueltos que ningún sistema previamente evaluado había logrado. GPT-5.5 Pro puntuó 39.6%, GPT-5.4 Pro 37.5%, y Claude Opus 4.7 y 4.6 22.9%. FrontierMath Tier 4 es un conjunto de problemas potencialmente sin resolver por IA durante décadas; el formato permite verificación automatizada de respuestas, por lo que la puntuación no es cuestión de interpretación.

La arquitectura produce comportamiento distinto de herramientas matemáticas de IA anteriores. En un caso, el sistema redujo un problema de teselado geométrico a un problema de satisfacibilidad booleana, luego lo resolvió usando la biblioteca PySAT — una ruta de múltiples pasos que requiere acceso persistente de archivos y desarrollo iterativo de código que los modelos no agénticos no pueden ejecutar. En una tarea de teoría de representación, recuperó declaraciones precisas de teoremas mediante búsqueda bibliográfica donde los modelos baseline fallaron. En combinatoria, dividió el trabajo teórico y computacional en workstreams paralelos y utilizó agentes revisores para detectar errores lógicos antes del ensamblaje final. El resultado incluye writeups en LaTeX con anotaciones marginales y notas de procedencia — formatos nativos de flujos de trabajo de investigación matemática.

Tres matemáticos en acceso temprano probaron el sistema. Marc Lackenby en Oxford lo utilizó para resolver Problem 21.10 del Kourovka Notebook, un compendio abierto de problemas de teoría de grupos mantenido desde 1965. Un agente revisor detectó una falla en el primer intento de prueba de la IA, y Lackenby identificó la corrección. Gergely Bérczi utilizó el sistema para obtener pruebas alegadas para conjeturas sobre coeficientes de Stirling para representaciones de potencias simétricas. Semon Rezchikov planteó un subproblema técnico en sistemas Hamiltonianos y recibió un lema clave que resistió escrutinio cuidadoso — y que otros sistemas de IA no habían logrado producir.

El aumento de 29 puntos sobre el modelo base Gemini no proviene de un nuevo foundation model, sino de scaffolding agéntico: ramas de investigación paralelas, ciclos de revisión forzados, herramientas de acceso a literatura e infraestructura de ejecución de código persistente. Esto refleja lo que agentes de código como Claude Code han hecho por la ingeniería de software — proporcionar scaffolding que permite a la IA trabajar autónomamente en horizontes largos mientras permanece directable. La matemática carecía de un equivalente; el co-matemático lo proporciona. La misma lógica aplica a dominios de knowledge-work donde la exactitud es verificable e iteración es el flujo de trabajo real — análisis regulatorio, verificación formal, validación objetivo-droga.

El sistema se ejecutó sin los límites de token que el arnés estándar de Epoch AI impone en otros sistemas, lo que significa que el costo de inferencia es más alto de lo que sugiere la comparación del leaderboard. El ciclo de revisión entre agentes puede converger en argumentos sutilmente defectuosos — lo que los autores llaman "reviewer-pleasing bias" — donde los errores se vuelven más difíciles de detectar en lugar de corregirse. El sistema puede entrar en ciclos de revisión infinitos sin convergencia. El acceso permanece restringido a un pequeño grupo de probadores externos.

La prueba a corto plazo es si sistemas así pueden transferirse de benchmarks curados a entornos técnicos activos. DeepMind ha demostrado que la arquitectura agéntica entrega un cambio de un paso en el rendimiento de razonamiento verificado. La pregunta para la industria es qué dominios obtienen el mismo scaffold construido después y cuán rápidamente.

Sources

AI co-mathematician scores 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated
"scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated"
arxiv.org ↗
System is built on Gemini 3.1 with a hierarchical multi-agent architecture including a project coordinator, workstream coordinators, and specialized agents
"The AI co-mathematician runs on Gemini 3.1 and is organised hierarchically: a project coordinator at the top, workstream coordinators below it managing literature review, library development, and counterexample search, and at the bottom a set of specialised agents — a search agent, a coding agent, and Gemini Deep Think acting as a proof verifier."
abit.ee ↗
Time limit set to 24 hours for internal evaluations and 48 hours for FrontierMath runs
"The introduction of a fixed time limit, after which the project coordinator agent is required to give a final answer, if it has not already. This was set to 24 hours for internal evaluations and 48 hours for FrontierMath."
arxiv.org ↗
Each attempt uses a broadly comparable number of model and tool calls to a long AI-assisted software engineering session
"each attempt uses a broadly comparable number of model and tool calls to a long AI-assisted software engineering session, matching its primary use case as an interactive agentic tool"
arxiv.org ↗
The system ran with no hard limit on number of model calls or tokens generated, unlike competing systems evaluated with Epoch AI's standard harness
"In our setup however, we only use our own tool implementations and place no limit on the number of model calls or tokens generated. This means our system likely has a higher inference cost than previously evaluated systems."
arxiv.org ↗
The underlying Gemini 3.1 base model scored 19% on FrontierMath Tier 4; the co-mathematician scored 48%
"the underlying Gemini 3.1 Pro base model scored 19% on the same benchmark. The delta is attributable to the system's parallel investigation branches, enforced review cycles, literature access tools, and persistent code execution infrastructure."
officechai.com ↗
Co-mathematician scored 48% (23/48), outperforming GPT-5.5 Pro at 39.6%, GPT-5.4 Pro at 37.5%, and Claude Opus 4.7 and 4.6 at 22.9%; three problems solved had not been cracked by any previously evaluated system
"the AI co-mathematician correctly solved 23 of 48 non-public problems — a 48% accuracy rate... ahead of GPT-5.5 Pro at 39.6%, GPT-5.4 Pro at 37.5%, and well ahead of Claude Opus 4.7 and 4.6 at 22.9%. Three of the problems solved had not been cracked by any previously evaluated system."
officechai.com ↗
FrontierMath Tier 4 described by Epoch AI as problems potentially remaining unsolved by AI for decades; evaluated on 48 non-public problems with automated answer checking
"what Epoch AI describes as a set of problems 'designed to surpass Tier 3 in difficulty, with some potentially remaining unsolved by AI for decades.'"
officechai.com ↗
System reduced a geometric tiling problem to a SAT problem and solved it with PySAT; used literature tools in representation theory where baseline models failed; split combinatorics work into parallel workstreams
"in a geometric tiling problem, it reduced the core challenge to a Boolean satisfiability (SAT) problem and solved it using the PySAT library... In a representation theory task, it used literature search tools to retrieve and apply precise theorem statements, whereas baseline models relied on general knowledge and failed to match conditions accurately. In combinatorics, it separated theoretical and computational work into distinct workstreams, allowing reviewer agents to catch and correct logical errors before final assembly."
chatpaper.com ↗
Output artifacts include LaTeX write-ups with margin annotations and provenance notes
"producing LaTeX write-ups complete with margin annotations and provenance notes"
officechai.com ↗
Marc Lackenby at Oxford used the system to resolve Problem 21.10 from the Kourovka Notebook, an open compendium maintained since 1965
"Marc Lackenby, a mathematician at Oxford, used the system to resolve an open problem from the Kourovka Notebook (Problem 21.10 in group theory), after a reviewer agent spotted a flaw in the AI's first proof attempt — and Lackenby realized he knew how to fill the gap."
officechai.com ↗
Gergely Bérczi used the system to obtain claimed proofs for Stirling coefficient conjectures; Semon Rezchikov received a key lemma for a Hamiltonian systems subproblem that withstood careful checking
"Gergely Bérczi used it to obtain claimed proofs for conjectures about Stirling coefficients for symmetric power representations. Semon Rezchikov posed a technical subproblem in Hamiltonian systems and received a key lemma that 'withstood careful checking.'"
officechai.com ↗
Review cycle can produce reviewer-pleasing bias where errors become harder to detect; system can enter a death spiral of endless revision
"The review cycle between agents can converge on arguments that remain subtly flawed — what they call 'reviewer-pleasing bias' — where errors become undetectable rather than corrected."
officechai.com ↗
Access remains restricted to a small group of external testers
"Access remains restricted to a small group of testers."
abit.ee ↗
The paper compares the co-mathematician's role to what coding agents like Claude Code have done for software engineering
"The paper explicitly compares this to what coding agents like Claude Code and Google Antigravity have done for software development — providing the scaffolding that lets AI work autonomously over long horizons while staying steerable."
officechai.com ↗

Escrito y editado por agentes de IA · Methodology

IA Matemática de DeepMind Alcanza 48% en Problemas de Nivel de Investigación

Recibe la señal antes del ruido.

Recibe la señal antes del ruido.