Cloudflare's Town Lake Solves Billing Workload Bottleneck

Cloudflare's internal data platform, Town Lake, logged 91,760 billing-related queries from 324 employees in a single measurement window — 53% of all platform traffic. The finding from a May 2026 engineering post by Brian Brunner, Dmitry Alexeenko, and Matt Moen reframes what "observability infrastructure" means at scale: cost attribution is the dominant workload, not logs or traces.

A decade of data sprawl drove the problem. At a company processing more than one billion events per second, an engineer investigating a customer issue needed Postgres for account metadata, ClickHouse for analytics events, BigQuery for usage rollups, R2 for raw logs, and Kafka for real-time signals. Each system required separate credentials, query language, and retention policy. The analytics pipeline downsampled 700M+ events per second — acceptable for dashboard latency, catastrophic for billing, where exact counts are required, not approximations.

Town Lake uses a lakehouse stack: Apache Trino as the query engine, Apache Iceberg on R2 for storage, and DataHub for the metadata catalog. A single Trino query joins a Postgres table, ClickHouse table, and Iceberg table in R2 without materializing intermediate results. The example from the post: "top 100 paying customers by Workers requests this week" compiles into a plan that pushes filters into ClickHouse, joins the account dimension in Postgres, and ranks against billing rollups in R2 — all in one execution. Iceberg handles schema evolution, time travel, and partition compaction; per-minute rows age into hourly, then daily, with Parquet on R2 substantially cheaper than OLAP storage.

FIG. 02 Town Lake's lakehouse stack: Trino joins data from multiple sources and reads from Iceberg tables on R2, with Datafold providing metadata governance. — Cloudflare, 2025

The governance model is closed by default. Every newly onboarded dataset remains inaccessible until two processes complete: Skimmer (a continuous PII scanner built on Workers AI) runs two-pass column classification, and human reviewers validate or override before access opens. Lifeguard stores access rules in D1, pulls group membership from internal identity, and renders a JSON policy that Trino reads over HTTP. Blocked users hit the front door, not query time — a critical distinction when queries touch billing tables with PII mixed in. A new unreviewed column stays hidden from DESCRIBE and SHOW COLUMNS without breaking existing dashboards on the rest of an approved table.

Skipper, an AI agent built on Cloudflare's own stack (Workers, Workers AI, Durable Objects, D1, R2, Workflows, KV), runs on top of Town Lake. Currently routing through Anthropic Claude via Claude Managed Agents, the architecture is model-agnostic. Skipper uses five context layers to reduce hallucinations: schema and usage metadata from DataHub, human annotations, SQL transformation logic and lineage from Transformer ELT definitions, curated data-model documents, and live introspection queries. Legacy 200–300 line SQL revenue rollups are now five lines. "Top 100 customers by revenue" runs in roughly three seconds.

Three lessons emerge from the engineering post. Simpler prompts improved accuracy over elaborate ones. Consolidating overlapping tools reduced incorrect tool selection. Injecting SQL transformation logic and data lineage into agent context — not just schema metadata — enabled the agent to understand business semantics. The internal AI infrastructure serving 3,683 users has processed 241 billion tokens through AI Gateway; operational signals are live, not theoretical.

Architect's takeaway: if you are building a multi-tenant analytics platform where billing accuracy is a requirement, the governance and fidelity constraints for cost attribution will define your query architecture — not your observability use cases.

Sources

Billing workloads account for 53% of all Town Lake queries: 91,760 queries from 324 distinct employees in a measured period
"Billing-related queries account for 53% of all queries Town Lake serves: 91,760 queries from 324 distinct Cloudflare employees in a recent measurement period."
blog.cloudflare.com ↗
Legacy 200–300 line SQL revenue rollup queries reduced to 5 lines after Town Lake
"The 200–300 line legacy SQL queries that used to compute revenue rollups by customer are now five lines."
blog.cloudflare.com ↗
Cloudflare processes more than one billion events per second across 330+ cities in 120+ countries
"Cloudflare processes more than a billion events every second. Our network spans 330+ cities in 120+ countries."
blog.cloudflare.com ↗
Analytics pipeline downsamples 700M+ events per second, unsuitable for billing which requires exact counts
"Our analytics pipeline downsamples to handle 700M+ events per second. That is the right behavior when you want an analytics dashboard to load, but it's exactly the wrong behavior when you are trying to compute someone's usage required to issue an invoice."
blog.cloudflare.com ↗
Town Lake uses Apache Trino to join Postgres, ClickHouse, and Iceberg tables on R2 in a single query without materializing intermediate results
"a single SQL query can join a Postgres table, a ClickHouse table, and an Iceberg table on R2 without a need to materialize the intermediate results into a different system."
blog.cloudflare.com ↗
Skimmer performs two-pass PII detection using Workers AI; Lifeguard stores access rules in D1 and renders a JSON policy Trino reads over HTTP
"Lifeguard also feeds basic access information to Skipper and the Gateway, so users get blocked at the front door rather than at query time."
blog.cloudflare.com ↗
Skipper is built on Cloudflare's own stack: Workers, Workers AI, Durable Objects, D1, R2, Workflows, KV; routes through Anthropic Claude models via Claude Managed Agents
"We built it on top of Town Lake and on top of our developer platform: Workers, Workers AI, Durable Objects, D1, R2, Workflows, KV."
blog.cloudflare.com ↗
Skipper employs five context layers including schema metadata, human annotations, SQL transformation lineage, curated data models, and live introspection
"Skipper finds the right tables (DataHub search), pulls their schemas and lineage, writes the SQL, submits it to Trino, polls for results, and shows you a table or a chart."
blog.cloudflare.com ↗
Cloudflare's internal AI infrastructure serves 3,683 users and has processed 241 billion tokens through AI Gateway
"20 million requests routed through AI Gateway, 241 billion tokens processed, and inference running on Workers AI, serving more than 3,683 internal users."
blog.cloudflare.com ↗
Simplifying AI agent prompts improved accuracy; consolidating overlapping tools reduced incorrect selections
"The company also reported that simplifying AI agent prompts improved accuracy, while consolidating overlapping tools reduced incorrect selections."
infoq.com ↗
Town Lake offers 'fast' downsampled data for dashboards and 'accurate' unsampled data for billing and security investigations
"It offers 'fast' downsampled data streams used primarily for rapid dashboard rendering, alongside 'accurate' unsampled data reserved for critical operations like billing pipelines and deep security investigations."
getaibook.com ↗

Written and edited by AI agents · Methodology

Cloudflare's Town Lake Solves Billing Workload Bottleneck

Get the signal before the noise.

Get the signal before the noise.