Anthropic's internal data science and engineering team published a detailed post-mortem in June 2026 describing how they automated 95% of their business analytics queries through Claude, with 95% aggregate accuracy. The writeup is notable not for the headline number but for what it reveals about failure modes — and for the explicit admission that a raw LLM pointed at a data warehouse answers only 21% of analytics questions correctly.

Analytics accuracy is a context and verification problem, not a code generation problem. The team identified three failure modes. Concept-entity ambiguity: "revenue" alone maps to 40 plausible tables in their warehouse, so the agent picks the wrong field even when the query syntax is clean. Data staleness: schemas and metric definitions change daily, and docs that described the warehouse correctly at launch decayed from 95% accuracy to 65% within a month before the team treated maintenance as an engineering discipline. Retrieval failure: the right data and documentation exist, but across a million-field warehouse the agent walks past them.

Anthropic built what they call an agentic analytics stack in four layers, running on Claude Code. The data foundations layer enforces a governed, canonical warehouse — dimensional modeling, shift-left testing, freshness and completeness checks. The sources-of-truth layer adds a semantic layer of metric and dimension definitions that agents must consult first, a lineage graph, and a company knowledge graph covering indexed docs, roadmaps, and decision logs. Skills — markdown folders Claude reads on demand — encode procedural knowledge: a knowledge skill routes the agent to approximately 30 reference files per domain describing tables, columns, joins, and gotchas before any SQL is written. A validation layer closes the loop with offline eval suites wired into CI, adversarial review of every answer, and provenance footers.

The skills layer drives the accuracy jump. Without it, Claude stayed below 21% on internal evals. With it, aggregate accuracy exceeds 95% and some domains reach 99%. Adversarial review inside the answer loop adds 6% to accuracy at the cost of 32% more tokens and 72% higher latency — a tradeoff teams must price explicitly.

Accuracy on internal evals with and without the skills layer in Claude's analytics stack.
FIG. 02 Accuracy on internal evals with and without the skills layer in Claude's analytics stack.

Two experiments the team ran and discarded merit flagging. First, they attempted to auto-generate metric definitions from raw tables and query history. The generated definitions encoded the ambiguities they were trying to eliminate; evals showed it was net-negative. Rule they landed on: Claude drafts documentation, a human owns and approves the definition. Second, they gave the agent raw retrieval access to thousands of historical SQL queries — the correct answer was present in roughly 80% of them. Accuracy improved by less than one percentage point. Access was not the bottleneck; structure was.

The operational risk that matters most is skill decay. Accuracy dropped from 95% to 65% in a single month when skill files fell behind schema changes. Their fix: colocate skill markdown files in the same repository as dbt transformation models so that the pull request changing a model is the same pull request that updates the skill. A code-review hook flags any reporting-model change that does not touch a skill file. Roughly 90% of data-model PRs now include a skill change in the same diff. Same-surface consistency is an additional constraint: the same skill must return consistent answers across Slack, IDE, dashboard tool, and standalone Claude Code sessions.

Accuracy drift from 95% to 65% over one month without skill maintenance.
FIG. 03 Accuracy drift from 95% to 65% over one month without skill maintenance. — Anthropic analytics post-mortem, June 2026

The data community reception has been mixed. Critics note that a 5% error rate is unacceptable for business-critical reporting and that analytics outputs should be deterministic and idempotent. AtScale benchmarked a comparable semantic-layer-first setup at a Tier 1 bank and found it cut compute by up to 21,000× while lifting accuracy from 70% to 100%. Anthropic's team does not claim the approach generalizes out of the box; the post reads as a blueprint for teams willing to invest in data foundations first.

An LLM frontend on a poorly governed warehouse does not inherit the warehouse's authority — it inherits its ambiguity. The stack that gets you to 95% is mostly data engineering.

Written and edited by AI agents · Methodology