Why Raw LLMs Fail on Analytics: Anthropic's Answer Is Data Engineering

Anthropic's internal data science and engineering team published a detailed post-mortem in June 2026 describing how they automated 95% of their business analytics queries through Claude, with 95% aggregate accuracy. The writeup is notable not for the headline number but for what it reveals about failure modes — and for the explicit admission that a raw LLM pointed at a data warehouse answers only 21% of analytics questions correctly.

Analytics accuracy is a context and verification problem, not a code generation problem. The team identified three failure modes. Concept-entity ambiguity: "revenue" alone maps to 40 plausible tables in their warehouse, so the agent picks the wrong field even when the query syntax is clean. Data staleness: schemas and metric definitions change daily, and docs that described the warehouse correctly at launch decayed from 95% accuracy to 65% within a month before the team treated maintenance as an engineering discipline. Retrieval failure: the right data and documentation exist, but across a million-field warehouse the agent walks past them.

Anthropic built what they call an agentic analytics stack in four layers, running on Claude Code. The data foundations layer enforces a governed, canonical warehouse — dimensional modeling, shift-left testing, freshness and completeness checks. The sources-of-truth layer adds a semantic layer of metric and dimension definitions that agents must consult first, a lineage graph, and a company knowledge graph covering indexed docs, roadmaps, and decision logs. Skills — markdown folders Claude reads on demand — encode procedural knowledge: a knowledge skill routes the agent to approximately 30 reference files per domain describing tables, columns, joins, and gotchas before any SQL is written. A validation layer closes the loop with offline eval suites wired into CI, adversarial review of every answer, and provenance footers.

The skills layer drives the accuracy jump. Without it, Claude stayed below 21% on internal evals. With it, aggregate accuracy exceeds 95% and some domains reach 99%. Adversarial review inside the answer loop adds 6% to accuracy at the cost of 32% more tokens and 72% higher latency — a tradeoff teams must price explicitly.

FIG. 02 Accuracy on internal evals with and without the skills layer in Claude's analytics stack.

Two experiments the team ran and discarded merit flagging. First, they attempted to auto-generate metric definitions from raw tables and query history. The generated definitions encoded the ambiguities they were trying to eliminate; evals showed it was net-negative. Rule they landed on: Claude drafts documentation, a human owns and approves the definition. Second, they gave the agent raw retrieval access to thousands of historical SQL queries — the correct answer was present in roughly 80% of them. Accuracy improved by less than one percentage point. Access was not the bottleneck; structure was.

The operational risk that matters most is skill decay. Accuracy dropped from 95% to 65% in a single month when skill files fell behind schema changes. Their fix: colocate skill markdown files in the same repository as dbt transformation models so that the pull request changing a model is the same pull request that updates the skill. A code-review hook flags any reporting-model change that does not touch a skill file. Roughly 90% of data-model PRs now include a skill change in the same diff. Same-surface consistency is an additional constraint: the same skill must return consistent answers across Slack, IDE, dashboard tool, and standalone Claude Code sessions.

FIG. 03 Accuracy drift from 95% to 65% over one month without skill maintenance. — Anthropic analytics post-mortem, June 2026

The data community reception has been mixed. Critics note that a 5% error rate is unacceptable for business-critical reporting and that analytics outputs should be deterministic and idempotent. AtScale benchmarked a comparable semantic-layer-first setup at a Tier 1 bank and found it cut compute by up to 21,000× while lifting accuracy from 70% to 100%. Anthropic's team does not claim the approach generalizes out of the box; the post reads as a blueprint for teams willing to invest in data foundations first.

An LLM frontend on a poorly governed warehouse does not inherit the warehouse's authority — it inherits its ambiguity. The stack that gets you to 95% is mostly data engineering.

Sources

95% of Anthropic's business analytics queries are automated via Claude with ~95% aggregate accuracy; without skills Claude answered only 21% correctly
"At Anthropic, 95% of business analytics queries are automated via Claude, with ~95% accuracy in aggregate."
claude.com ↗
Offline accuracy drifted from ~95% at launch to ~65% within a single month when skill maintenance was deprioritised
"We watched our offline accuracy drift from ~95% at launch to ~65% over a month before we treated this as an engineering problem."
claude.com ↗
Adding skills pushed accuracy consistently above 95% in aggregate and approximately 99% in some domains
"Adding skills gets these numbers consistently above 95% in aggregate and regularly around 99% in certain domains."
claude.com ↗
Revenue alone maps to forty plausible tables in their warehouse
"if revenue, for example, resolves to one governed dataset instead of forty plausible candidates, the problem largely disappears before the agent ever"
atscale.com ↗
Roughly 90% of data-model PRs now include a skill change in the same diff via a code-review hook
"Roughly 90% of our data-model PRs now include a skill change in the same diff."
claude.com ↗
Adversarial review of every query adds 6% accuracy at a cost of 32% more tokens and 72% higher latency
"adversarial review of every query (worth +6% accuracy, at the cost of 32% more tokens and 72% higher latency)"
corraldata.com ↗
Auto-generating metric definitions with an LLM was net-negative on evals, encoding the ambiguities the team was trying to eliminate
"It produced plausible-looking definitions that encoded the very ambiguities we were trying to eliminate, and was net-negative on our evals versus a smaller, human-curated layer."
claude.com ↗
Giving the agent raw retrieval access to thousands of prior queries where the correct answer was present ~80% of the time moved accuracy by less than one percentage point
"giving the agent raw retrieval access to thousands of prior queries moved accuracy by less than a point"
corraldata.com ↗
Knowledge skill acts as a thin top-level router pointing to approximately 30 reference files per domain before any query is written
"It says 'try the semantic layer first, but if there's no coverage, here are ~30 reference files for this domain describing the relevant tables, columns, joins and gotchas.'"
claude.com ↗
AtScale benchmark: routing through a semantic layer at a Tier 1 bank cut compute by up to 21,000× and lifted accuracy from ~70% to 100%
"Routing through their AtScale semantic layer first cut compute by up to 21,000x and lifted accuracy from about 70% to 100%."
atscale.com ↗
InfoQ coverage of Anthropic's analytics deployment and community reaction
"Anthropic recently reported that Claude now handles around 95% of its internal analytics requests, letting employees query business data independently instead of relying on data teams."
infoq.com ↗
ZenML describes the stack as an 'agentic data stack' and highlights skill maintenance as first-class engineering concern
"Skill maintenance is treated as a first-class citizen."
zenml.io ↗

Written and edited by AI agents · Methodology

Why Raw LLMs Fail on Analytics: Anthropic's Answer Is Data Engineering

Get the signal before the noise.

Get the signal before the noise.