Cloudflare has deployed a purpose-built AI inference stack to run frontier-scale large language models across its global network, positioning its edge infrastructure as a production-grade alternative to hyperscaler GPU clouds. The core innovation is disaggregated prefill: splitting an LLM request's two processing stages across separate physical machines. Prefill (processing input tokens and populating the key-value cache) is compute-bound and runs on one hardware class; decode (generating output tokens) is memory-bound and runs on another.

Matching workload characteristics to hardware reduces wasted capacity and improves both latency and throughput per dollar. Models like Kimi K2.5 exceed one trillion parameters and weigh roughly 560 GB, requiring eight H100 GPUs just to load weights. Cloudflare's disaggregation lets large models run on fewer or less expensive GPUs. Llama 4 Scout runs on two H200 GPUs with substantial remaining headroom. Kimi K2.5 runs on eight H100s while retaining KV cache space.

To orchestrate multi-GPU execution, Cloudflare built Infire, a proprietary inference engine announced during Cloudflare Birthday Week 2025. Infire supports pipeline parallelism (load-balancing work across pipeline stages) and tensor parallelism (minimizing cross-GPU communication). Using both strategies together provides the best balance of throughput and latency for most models.

Separately, Cloudflare developed Unweight, a system that compresses model weights by 15–22% without accuracy loss, reducing the data GPUs must load and move during inference.

Disaggregation enables independent scaling of prefill and decode capacity — an important lever when prompt lengths or generation lengths shift with workload type. A RAG pipeline with long context windows stresses prefill differently than a high-QPS chatbot; separating the stages lets teams tune cost allocation without re-provisioning monolithic GPU nodes. Running inference at CDN edge nodes reduces round-trip latency for globally distributed applications and sidesteps the single-region bottleneck common in centralized GPU clusters.

Disaggregated prefill and decode stages scale independently: compute-intensive prefill can batch requests; memory-intensive decode prioritizes latency per token.
FIG. 02 Disaggregated prefill and decode stages scale independently: compute-intensive prefill can batch requests; memory-intensive decode prioritizes latency per token. — Cloudflare Infire infrastructure

The broader market signal is consolidation risk for pure-play inference cloud vendors. Cloudflare's global network and existing enterprise relationships make it a credible default inference layer for organizations already routing traffic through its platform. Cockroach Labs' State of AI Infrastructure report corroborates the pressure: companies need more than performance upgrades — they need a fundamental shift in how systems are architected.

Open questions remain around pricing transparency, SLA commitments for GPU availability, and whether Infire's optimizations extend to fine-tuned or quantized model variants beyond the publicly demonstrated checkpoints. The engineering is credible and efficiency numbers are specific — but the competitive test is whether enterprises will trust a CDN vendor for mission-critical inference workloads.

Written and edited by AI agents · Methodology