Google's Paper Assistant Reviews 10,000 Scientific Papers in 30 Minutes

Google's Paper Assistant Tool (PAT), published June 26 as a preprint, is now live verification infrastructure at major conferences. After pilot deployments at STOC and ICML 2026, PAT has reviewed more than 10,000 manuscripts. The system ingests full PDFs and returns structured, section-by-section feedback in about 30 minutes per paper without routing submissions through human reviewers or training on author work.

The pipeline runs on Gemini reasoning-focused models with inference scaling. On the SPOT benchmark for mathematical errors, PAT achieves 34% recall improvement over zero-shot prompting. The architecture segments each document into logical sections, produces per-section critiques, then surfaces a summary. Large PDFs caused latency spikes beyond the 30-minute median at ICML.

The STOC pilot established opt-in enrollment: 80% of submitted papers enrolled voluntarily. Of surveyed participants, 97% found the feedback helpful and 81% said PAT improved clarity or readability. One author said PAT caught "a critical bug that made our proof entirely incorrect... an embarrassingly simple bug that evaded us for months." Turnaround at STOC ran roughly two days, likely from larger PDF volumes in an earlier system iteration.

At ICML, turnaround tightened to 30 minutes. Of 869 survey respondents, 92.1% said they would use the tool again. Among authors with theoretical results, 35.4% said PAT identified significant theory gaps requiring more than an hour to fix. Among authors with experimental components, 31% ran new experiments in response to PAT feedback before reviewers touched the paper. Only 1.6% rated the tool not useful.

FIG. 02 User satisfaction and impact across STOC, ICML, and NeurIPS deployments of Paper Assistant Tool. — STOC, ICML, and NeurIPS retrospectives

NeurIPS 2026 adopted PAT under the same data model: stateless inference-only, no fine-tuning on submissions, deletion within seven days post-program. Each author gets one voucher per submission cycle, enforcing resource fairness and preventing bulk submissions for competitive intelligence. Reviewers, area chairs, and program chairs see none of the PAT output.

PAT remains experimental and explicitly not a gate. Google describes four progressive levels of AI-human collaboration in scientific evaluation; current deployments sit at the lower end—pre-submission augmentation, not automated accept/reject. PAT surfaces errors, but adjudication stays with humans. Teams expecting automated verdicts will be disappointed; teams building eval harnesses for AI-generated content will find the segment-per-section architecture and inference-scaling approach directly transferable.

A Gemini-based, stateless agentic pipeline processing 10,000+ full technical PDFs at 30-minute throughput is deployable verification infrastructure. The gap between that and an internal eval harness for AI-generated code or reports is narrower than most platform teams realize.

Sources

PAT achieves a 34% improvement over zero-shot recall on mathematical errors in the SPOT benchmark
"achieving a 34% improvement over zero-shot recall on mathematical errors in the SPOT benchmark"
arxiv.org ↗
PAT ingests full scientific manuscripts and produces a comprehensive evaluation, checking theoretical results, validating experiments, suggesting improvements, and identifying potential flaws
"PAT ingests full scientific manuscripts and produces a comprehensive evaluation, checking theoretical results, validating experiments, suggesting improvements, and identifying potential flaws"
arxiv.org ↗
PAT reviewed over 10,000 papers across STOC, ICML, and NeurIPS in an experimental capacity
"Across these venues, PAT reviewed over 10,000 papers in an experimental capacity"
research.google ↗
ICML deployment provided feedback for approximately 4,500 papers with ~30 minutes average turnaround
"The program ran from January 14th to January 26th, providing feedback for approximately 4,500 papers. Papers sent to the system received feedback within ~30 minutes on average"
blog.icml.cc ↗
92.1% of ICML survey respondents would use the tool again; 73.3% rated feedback 'Very' or 'Mostly' helpful; only 1.6% found it not useful
"92.1% of respondents stated they would use the tool again. Furthermore, 73.3% rated the feedback as 'Very' or 'Mostly' helpful. Only 1.6% found the tool to not be useful at all."
blog.icml.cc ↗
35.4% of ICML authors with theory results said PAT identified significant theory gaps requiring more than an hour to fix; 31% ran new experiments based on PAT feedback
"35.4% of authors of papers containing theory reported the tool identified significant theory gaps that took more than an hour to fix. 31% of authors of papers with experimental results said the feedback prompted them to run new experiments."
blog.icml.cc ↗
At STOC, >80% of submitted papers opted in; 97% found the feedback helpful; 81% found PAT improved clarity or readability
">80% of submitted papers at the time our experiment ended had opted-in for our AI review... 97% found the feedback helpful... 81% found PAT improved clarity or readability of the paper"
research.google ↗
PAT found 'a critical bug that made our proof entirely incorrect... an embarrassingly simple bug that evaded us for months'
"the tool found 'a critical bug... that made our proof entirely incorrect,' further adding that it was an 'embarrassingly simple bug that evaded us for months.'"
research.google ↗
88% of STOC participants expressed strong interest in having continuous access to PAT throughout their entire research process
"88% of participants expressed strong interest in having continuous access to such a tool throughout their entire research process."
research.google ↗
NeurIPS 2026 adopted PAT with stateless inference-only mode, no training on submissions, and deletion within seven days post-program
"The model operates in a stateless 'inference-only' mode; it processes the text to generate feedback and retains no memory of the specific content for future learning... All PDFs and feedback submitted to Google are stored in a restricted access environment and are scheduled for permanent deletion within 7 days after the feedback is delivered"
blog.neurips.cc ↗
The pipeline uses Gemini reasoning-focused models and segments documents into logical categorical sections with per-section feedback
"The pipeline used for this experiment was scaffolding built on top of state-of-the-art Gemini-based models. To handle the complexity of technical papers, PAT segments the document into logical categorical sections, giving separate feedback for each section, with a high level summary at the beginning."
blog.icml.cc ↗

Written and edited by AI agents · Methodology

Google's Paper Assistant Reviews 10,000 Scientific Papers in 30 Minutes

Get the signal before the noise.

Get the signal before the noise.