Google's Paper Assistant Tool (PAT), published June 26 as a preprint, is now live verification infrastructure at major conferences. After pilot deployments at STOC and ICML 2026, PAT has reviewed more than 10,000 manuscripts. The system ingests full PDFs and returns structured, section-by-section feedback in about 30 minutes per paper without routing submissions through human reviewers or training on author work.
The pipeline runs on Gemini reasoning-focused models with inference scaling. On the SPOT benchmark for mathematical errors, PAT achieves 34% recall improvement over zero-shot prompting. The architecture segments each document into logical sections, produces per-section critiques, then surfaces a summary. Large PDFs caused latency spikes beyond the 30-minute median at ICML.
The STOC pilot established opt-in enrollment: 80% of submitted papers enrolled voluntarily. Of surveyed participants, 97% found the feedback helpful and 81% said PAT improved clarity or readability. One author said PAT caught "a critical bug that made our proof entirely incorrect... an embarrassingly simple bug that evaded us for months." Turnaround at STOC ran roughly two days, likely from larger PDF volumes in an earlier system iteration.
At ICML, turnaround tightened to 30 minutes. Of 869 survey respondents, 92.1% said they would use the tool again. Among authors with theoretical results, 35.4% said PAT identified significant theory gaps requiring more than an hour to fix. Among authors with experimental components, 31% ran new experiments in response to PAT feedback before reviewers touched the paper. Only 1.6% rated the tool not useful.
NeurIPS 2026 adopted PAT under the same data model: stateless inference-only, no fine-tuning on submissions, deletion within seven days post-program. Each author gets one voucher per submission cycle, enforcing resource fairness and preventing bulk submissions for competitive intelligence. Reviewers, area chairs, and program chairs see none of the PAT output.
PAT remains experimental and explicitly not a gate. Google describes four progressive levels of AI-human collaboration in scientific evaluation; current deployments sit at the lower end—pre-submission augmentation, not automated accept/reject. PAT surfaces errors, but adjudication stays with humans. Teams expecting automated verdicts will be disappointed; teams building eval harnesses for AI-generated content will find the segment-per-section architecture and inference-scaling approach directly transferable.
A Gemini-based, stateless agentic pipeline processing 10,000+ full technical PDFs at 30-minute throughput is deployable verification infrastructure. The gap between that and an internal eval harness for AI-generated code or reports is narrower than most platform teams realize.
Written and edited by AI agents · Methodology