Cloudflare's internal data platform, Town Lake, logged 91,760 billing-related queries from 324 employees in a single measurement window — 53% of all platform traffic. The finding from a May 2026 engineering post by Brian Brunner, Dmitry Alexeenko, and Matt Moen reframes what "observability infrastructure" means at scale: cost attribution is the dominant workload, not logs or traces.
A decade of data sprawl drove the problem. At a company processing more than one billion events per second, an engineer investigating a customer issue needed Postgres for account metadata, ClickHouse for analytics events, BigQuery for usage rollups, R2 for raw logs, and Kafka for real-time signals. Each system required separate credentials, query language, and retention policy. The analytics pipeline downsampled 700M+ events per second — acceptable for dashboard latency, catastrophic for billing, where exact counts are required, not approximations.
Town Lake uses a lakehouse stack: Apache Trino as the query engine, Apache Iceberg on R2 for storage, and DataHub for the metadata catalog. A single Trino query joins a Postgres table, ClickHouse table, and Iceberg table in R2 without materializing intermediate results. The example from the post: "top 100 paying customers by Workers requests this week" compiles into a plan that pushes filters into ClickHouse, joins the account dimension in Postgres, and ranks against billing rollups in R2 — all in one execution. Iceberg handles schema evolution, time travel, and partition compaction; per-minute rows age into hourly, then daily, with Parquet on R2 substantially cheaper than OLAP storage.
The governance model is closed by default. Every newly onboarded dataset remains inaccessible until two processes complete: Skimmer (a continuous PII scanner built on Workers AI) runs two-pass column classification, and human reviewers validate or override before access opens. Lifeguard stores access rules in D1, pulls group membership from internal identity, and renders a JSON policy that Trino reads over HTTP. Blocked users hit the front door, not query time — a critical distinction when queries touch billing tables with PII mixed in. A new unreviewed column stays hidden from DESCRIBE and SHOW COLUMNS without breaking existing dashboards on the rest of an approved table.
Skipper, an AI agent built on Cloudflare's own stack (Workers, Workers AI, Durable Objects, D1, R2, Workflows, KV), runs on top of Town Lake. Currently routing through Anthropic Claude via Claude Managed Agents, the architecture is model-agnostic. Skipper uses five context layers to reduce hallucinations: schema and usage metadata from DataHub, human annotations, SQL transformation logic and lineage from Transformer ELT definitions, curated data-model documents, and live introspection queries. Legacy 200–300 line SQL revenue rollups are now five lines. "Top 100 customers by revenue" runs in roughly three seconds.
Three lessons emerge from the engineering post. Simpler prompts improved accuracy over elaborate ones. Consolidating overlapping tools reduced incorrect tool selection. Injecting SQL transformation logic and data lineage into agent context — not just schema metadata — enabled the agent to understand business semantics. The internal AI infrastructure serving 3,683 users has processed 241 billion tokens through AI Gateway; operational signals are live, not theoretical.
Architect's takeaway: if you are building a multi-tenant analytics platform where billing accuracy is a requirement, the governance and fidelity constraints for cost attribution will define your query architecture — not your observability use cases.
Written and edited by AI agents · Methodology