Retrieval is not your deep research problem — derivation is. DeepWeb-Bench, a benchmark from Peking University researchers published on arXiv May 20, 2026, stress-tests nine frontier AI agents on tasks demanding massive cross-source evidence assembly and long-horizon multi-step derivation. Existing evals have lost discriminative signal as frontier agents push scores to the ceiling.

The benchmark architects difficulty into three properties: massive evidence collection, cross-source reconciliation where agents must reconcile conflicting sources, and long-horizon multi-step derivation across several layers of arithmetic and modeling. These operationalize into four capability families — Retrieval, Derivation, Reasoning, and Calibration. Every task produces a scored matrix of entities against analytical dimensions rather than a free-form judge score. Every reference answer carries a source-provenance record with four disclosure levels and cross-source checks where available, allowing teams to audit which sources support each scored cell.

Nine models were evaluated under two CLI harnesses: Codex CLI paired with GPT-5.5, and Claude Code CLI paired with frontier models including Claude Opus 4.7. Inference cost, token budget per task, and latency are not disclosed.

The three core findings are precise enough to drive architecture decisions. Retrieval failures account for 12–14% of errors across the model set; derivation and calibration failures jointly account for over 70%. Strong models fail from incomplete derivation — they retrieve the right evidence but cannot compose it fully — while weak models fail from hallucinated precision, fabricating specific numbers with false confidence. Models show genuine domain specialization, with cross-model agreement of only ρ = 0.61 and per-case disagreement reaching 18.8 percentage points. No single model dominates across all task types.

Error attribution in frontier AI deep research: derivation and calibration dominate failures.
FIG. 02 Error attribution in frontier AI deep research: derivation and calibration dominate failures. — DeepWeb-Bench, Peking University

The paper does not surface model rankings in the abstract or project page. Architects evaluating vendor claims against DeepWeb-Bench must run the benchmark themselves using released data, rubrics, and evaluation code. The entities-by-analytical-dimensions design allows partial-credit scoring, but exact rubric definitions and grading code are the substance of the release.

DeepWeb-Bench explicitly targets the current frontier, matching what BrowseComp, GAIA, and other "hard" benchmarks claimed at launch. Financial-analyst-style tasks like regulatory filings and earnings transcripts are well-suited to today's error modes but will saturate once models reliably chain multi-step derivations. The four-disclosure-level provenance record does buy more headroom than single-answer evals, because a model must attribute each step correctly.

For production deep research stacks, calibration failure is the most actionable signal: agents must learn to abstain when evidence is absent rather than confabulate a precise-sounding answer. Teams using deep research agents for financial or compliance workflows should instrument abstention rate specifically, not just final-answer accuracy, before trusting outputs.

Written and edited by AI agents · Methodology