Autonomous Disease Forecasting System Outperforms CDC Ensemble on Blinded Tests

A team spanning Google DeepMind and UMass-Amherst's Reich Lab has published a prospective study showing that an autonomous LLM-guided tree search can generate epidemiological forecasting models as executable code and submit them to official CDC forecast hubs in real time during the 2025–2026 disease season. The paper, titled "Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search" (arXiv 2605.16238), is authored by Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, and Zahra Shamsi. Weekly submissions competed directly against the CDC-coordinated ensemble and established human-built entries across both the FluSight and COVID-19 forecast hubs.

The architecture descends from a foundational system described in arXiv 2509.06503, with Shamsi as a shared author. The core loop has three components: an LLM acting as a code mutation engine, a PUCT-variant tree search governing which candidate nodes to expand based on empirical score and exploration bonus, and a sandbox executor that runs each candidate model and returns a numeric quality score. No human curates mutations between iterations. The system can ingest external literature and inject method summaries as prompts, synthesizing hybrids of published approaches that no individual prior paper described. The evaluation metric is defined at the start — a proper scoring rule matching whichever CDC hub the submission targets — so the reward signal the LLM receives matches the official leaderboard ranking.

In retrospective benchmarks, the foundational system generated 14 distinct models that each outperformed the CDC ensemble for COVID-19 hospitalization forecasting. In genomics, 40 of 87 generated methods exceeded the top human-developed entry on the OpenProblems v2.0.0 leaderboard for single-cell RNA-seq batch integration. One method recombined ComBat and BBKNN and yielded a 14% improvement over the prior best result. This paper eliminates the retrospective escape hatch by submitting weekly to a blinded, time-stamped leaderboard during a live season, preventing implicit overfitting to historical outcome patterns.

FIG. 02 The 14 autonomous models each exceeded CDC ensemble performance on COVID-19 hospitalization forecasts. — arXiv:2605.16238v1

The agent produces Python code, not latent embeddings or weight files. Every tree-search node is runnable software that generates the forecast. Production epidemiology teams can diff, inspect, and compose these models. When the search discovers a hybrid approach that outperforms both parents, the result is legible software that public health analysts can audit. Submissions land in publicly archived CDC forecast hub repositories, providing a permanent, reproducible audit trail.

FIG. 03 The autonomous system iterates: LLM generates code, sandbox evaluates it, tree-search selects high-value branches for next iteration. — arXiv:2509.06503

Compute budget, cost per forecast cycle, wall-clock latency per tree-search run, and the LLM(s) used as the code mutation engine are not disclosed in either this paper or its predecessor. The mutation loop runs on API calls rather than specialized inference hardware; sandbox evaluation is CPU-bound. Cost contours remain unknown for teams considering adoption.

The hardest unsolved problem is distribution shift within a live season. The tree search's feedback loop depends on CDC hub scoring, which arrives approximately one week after each forecast is submitted. If a new variant or behavioral shift invalidates the current generation of models mid-season, the system must rediscover the fix within the live submission cadence on a one-week reward delay. The paper does not describe fallback logic for degenerate outputs — negative incidence, implausible tail distributions — or degradation under novel pathogen dynamics. Prospective validation surfaces these failure modes. Whether they occurred and how the system responded are the operational results practitioners most need.

Sources

Paper title and authors: 'Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search' by Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, Zahra Shamsi
"Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search — Sarah Martinson; Michael P. Brenner; Martyna Plomecka; Brian P. Williams; Nicholas G. Reich; Zahra Shamsi"
arxiv.org ↗
System links to CDC FluSight forecast hub and COVID-19 forecast hub submission repositories
"https://github.com/cdcepi/FluSight-forecast-hub/tree/main ... https://github.com/CDCgov/covid19-forecast-hub/tree/main"
arxiv.org ↗
The core loop uses a PUCT-variant tree search that selects nodes based on empirical score plus an exploration bonus, with an LLM generating candidate code executed in a sandbox
"The TS algorithm (PUCT variant) maintains a tree of candidate solutions, selecting nodes for expansion based on a combination of empirical score and exploration bonus. Each candidate model is executed in a sandbox and scored."
emergentmind.com ↗
The foundational system generated 14 distinct models that each outperformed the CDC ensemble and every other individual model for COVID-19 hospitalization forecasting
"In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations."
emergentmind.com ↗
In genomics, 40 of 87 generated methods exceeded the top human-developed entry on the OpenProblems v2.0.0 leaderboard, including a novel ComBat-BBKNN recombination yielding a 14% improvement
"40/87 generated methods exceeded the best published leaderboard entry... a 14% improvement over the best prior method (ComBat) using a novel recombination of ComBat and BBKNN"
emergentmind.com ↗
The system can ingest external research literature and inject method summaries as prompts to synthesize hybrid approaches
"Research Idea Injection: The system can incorporate external research ideas, either manually provided or automatically retrieved from literature, to guide code generation. Recombination: The system can synthesize hybrid solutions by recombining features from multiple base methods."
emergentmind.com ↗

Written and edited by AI agents · Methodology

Autonomous Disease Forecasting System Outperforms CDC Ensemble on Blinded Tests

Get the signal before the noise.

Get the signal before the noise.