INDUSTRYBY AI|EXPERT SCOUT· Monday, May 4, 2026· 4 MIN READ
Cloudflare Runs Trillion-Parameter LLMs Across Global Edge Network
Cloudflare announced a new high-performance infrastructure specifically designed to run large language models, addressing latency and throughput constraints for enterprise inference. The move signals consolidation of AI infrastructure providers around CDN and edge networks as core deployment platforms.
Generative Imagery
Cloudflare's disaggregated inference: splitting compute and memory across the edgeFIG. 01
Cloudflare has deployed a purpose-built AI inference stack to run frontier-scale large language models across its global network, positioning its edge infrastructure as a production-grade alternative to hyperscaler GPU clouds. The core innovation is disaggregated prefill: splitting an LLM request's two processing stages across separate physical machines. Prefill (processing input tokens and populating the key-value cache) is compute-bound and runs on one hardware class; decode (generating output tokens) is memory-bound and runs on another.
Matching workload characteristics to hardware reduces wasted capacity and improves both latency and throughput per dollar. Models like Kimi K2.5 exceed one trillion parameters and weigh roughly 560 GB, requiring eight H100 GPUs just to load weights. Cloudflare's disaggregation lets large models run on fewer or less expensive GPUs. Llama 4 Scout runs on two H200 GPUs with substantial remaining headroom. Kimi K2.5 runs on eight H100s while retaining KV cache space.
To orchestrate multi-GPU execution, Cloudflare built Infire, a proprietary inference engine announced during Cloudflare Birthday Week 2025. Infire supports pipeline parallelism (load-balancing work across pipeline stages) and tensor parallelism (minimizing cross-GPU communication). Using both strategies together provides the best balance of throughput and latency for most models.
Separately, Cloudflare developed Unweight, a system that compresses model weights by 15–22% without accuracy loss, reducing the data GPUs must load and move during inference.
Disaggregation enables independent scaling of prefill and decode capacity — an important lever when prompt lengths or generation lengths shift with workload type. A RAG pipeline with long context windows stresses prefill differently than a high-QPS chatbot; separating the stages lets teams tune cost allocation without re-provisioning monolithic GPU nodes. Running inference at CDN edge nodes reduces round-trip latency for globally distributed applications and sidesteps the single-region bottleneck common in centralized GPU clusters.
FIG. 02Disaggregated prefill and decode stages scale independently: compute-intensive prefill can batch requests; memory-intensive decode prioritizes latency per token.— Cloudflare Infire infrastructure
The broader market signal is consolidation risk for pure-play inference cloud vendors. Cloudflare's global network and existing enterprise relationships make it a credible default inference layer for organizations already routing traffic through its platform. Cockroach Labs' State of AI Infrastructure report corroborates the pressure: companies need more than performance upgrades — they need a fundamental shift in how systems are architected.
Open questions remain around pricing transparency, SLA commitments for GPU availability, and whether Infire's optimizations extend to fine-tuned or quantized model variants beyond the publicly demonstrated checkpoints. The engineering is credible and efficiency numbers are specific — but the competitive test is whether enterprises will trust a CDN vendor for mission-critical inference workloads.