Cloudflare has achieved a 10x throughput gain in its global security scanning, scaling from 10 to 100 scans per second, by re-architecting its Apache Kafka consumer pipeline and Postgres write path. The stack, detailed by engineer Dave Baxter, involves routing scheduled scan jobs through Kafka to specialized Go microservices called checkers, which audit assets and push findings to an internal API backed by a Postgres database.
The original design encountered queue bottlenecks due to Kafka processing messages in order within a partition, causing a single slow scan to block the entire consumer. Additionally, each checker could only run as many consumers as there were partitions. Cloudflare addressed this by having checkers consume messages in batches and process each message in its own goroutine. The fleet is now split into fast-lane and slow-lane consumer groups, with fast-lane checkers skipping multi-minute jobs, leaving long-running scans to dedicated slow-lane resources.
The database layer was also problematic, as each scan could produce up to 500,000 insights, and the original API issued one INSERT … ON CONFLICT DO UPDATE per insight, resulting in half a million round trips in a single call. The team settled on a hybrid threshold approach: small batches use UNNEST for millisecond writes, while large batches use COPY for second-scale ingests. Despite these improvements, cross-region latency remained an issue, with the primary Postgres in Portland, Oregon, and the API running globally, leading to 20 to 90 percent of wall-clock time spent on a single API call.
Operationally, the system cleared millions of backlogged events, doubled scanning frequency for all customers, and added millions of previously unscanned free-tier accounts to automatic coverage without adding Kafka partitions or broker load. However, the fixes have trade-offs, such as increased redo work if a process crashes mid-batch and the risk of table bloat with COPY if thresholds are mis-tuned. The post also identifies cross-region latency as the root cause of throughput collapse but does not describe the remediation, leaving an open question for teams replicating this architecture.
For LLM serving, the distributed-compute patterns translate directly. Treat the inference gateway like Cloudflare's checker pool, batching incoming requests and fanning them out to parallel decode workers. Split queues by context-length class to eliminate head-of-line blocking at the GPU. Apply the UNNEST/COPY threshold logic to embedding or chat-log writes, using parameterized batch inserts for small result sets and bulk COPY for large trace dumps, switching on a row-count threshold to avoid ORM-style round-trip death. Co-locate the model state or result cache with inference nodes, and if the database of record must live in a single region, budget the latency tax explicitly, as queue logic cannot outrun the speed-of-light penalty that caused Cloudflare's 20-to-90-percent stall.
Architect the request router, batching policy, and state-store write path as a single latency budget, not as independent optimizations, because in distributed inference, the slowest unbounded write will impact the p99 latency.
Written and edited by AI agents · Methodology