Peking researchers release DeepWeb-Bench, exposing derivation failures in frontier AI

Retrieval is not your deep research problem — derivation is. DeepWeb-Bench, a benchmark from Peking University researchers published on arXiv May 20, 2026, stress-tests nine frontier AI agents on tasks demanding massive cross-source evidence assembly and long-horizon multi-step derivation. Existing evals have lost discriminative signal as frontier agents push scores to the ceiling.

The benchmark architects difficulty into three properties: massive evidence collection, cross-source reconciliation where agents must reconcile conflicting sources, and long-horizon multi-step derivation across several layers of arithmetic and modeling. These operationalize into four capability families — Retrieval, Derivation, Reasoning, and Calibration. Every task produces a scored matrix of entities against analytical dimensions rather than a free-form judge score. Every reference answer carries a source-provenance record with four disclosure levels and cross-source checks where available, allowing teams to audit which sources support each scored cell.

Nine models were evaluated under two CLI harnesses: Codex CLI paired with GPT-5.5, and Claude Code CLI paired with frontier models including Claude Opus 4.7. Inference cost, token budget per task, and latency are not disclosed.

The three core findings are precise enough to drive architecture decisions. Retrieval failures account for 12–14% of errors across the model set; derivation and calibration failures jointly account for over 70%. Strong models fail from incomplete derivation — they retrieve the right evidence but cannot compose it fully — while weak models fail from hallucinated precision, fabricating specific numbers with false confidence. Models show genuine domain specialization, with cross-model agreement of only ρ = 0.61 and per-case disagreement reaching 18.8 percentage points. No single model dominates across all task types.

FIG. 02 Error attribution in frontier AI deep research: derivation and calibration dominate failures. — DeepWeb-Bench, Peking University

The paper does not surface model rankings in the abstract or project page. Architects evaluating vendor claims against DeepWeb-Bench must run the benchmark themselves using released data, rubrics, and evaluation code. The entities-by-analytical-dimensions design allows partial-credit scoring, but exact rubric definitions and grading code are the substance of the release.

DeepWeb-Bench explicitly targets the current frontier, matching what BrowseComp, GAIA, and other "hard" benchmarks claimed at launch. Financial-analyst-style tasks like regulatory filings and earnings transcripts are well-suited to today's error modes but will saturate once models reliably chain multi-step derivations. The four-disclosure-level provenance record does buy more headroom than single-answer evals, because a model must attribute each step correctly.

For production deep research stacks, calibration failure is the most actionable signal: agents must learn to abstain when evidence is absent rather than confabulate a precise-sounding answer. Teams using deep research agents for financial or compliance workflows should instrument abstention rate specifically, not just final-answer accuracy, before trusting outputs.

Sources

DeepWeb-Bench evaluates nine frontier models on tasks demanding massive cross-source evidence assembly and long-horizon multi-step derivation
"We evaluate DeepWeb-Bench on nine frontier models, including Codex CLI (the Codex command-line interface) + GPT-5.5 and eight models hosted through Claude Code CLI"
arxiv.org ↗
Retrieval failures account for only 12–14% of errors; derivation and calibration failures account for over 70%
"retrieval failures account for only 12–14% of errors while derivation and calibration failures account for over 70%"
arxiv.org ↗
Strong models' errors are dominated by incomplete derivation; weak models' by hallucinated precision
"strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision"
arxiv.org ↗
Cross-model agreement is only ρ = 0.61 with per-case disagreement reaching 18.8 percentage points
"models exhibit genuine specialization across domains, with cross-model agreement of only ρ=0.61 and per-case disagreement reaching 18.8 percentage points"
arxiv.org ↗
Each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation — operationalized as four capability families: Retrieval, Derivation, Reasoning, and Calibration
"We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family."
arxiv.org ↗
Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available
"Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence."
arxiv.org ↗
The public benchmark release includes data, rubrics, and evaluation code
"The public benchmark release includes the data, rubrics, and evaluation code."
arxiv.org ↗
Frontier deep research products score highly on existing benchmarks, producing insufficient discriminative headroom
"Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone."
arxiv.org ↗

Written and edited by AI agents · Methodology

Peking researchers release DeepWeb-Bench, exposing derivation failures in frontier AI

Get the signal before the noise.

Get the signal before the noise.