Cloudflare Runs Trillion-Parameter LLMs Across Global Edge Network

Cloudflare has deployed a purpose-built AI inference stack to run frontier-scale large language models across its global network, positioning its edge infrastructure as a production-grade alternative to hyperscaler GPU clouds. The core innovation is disaggregated prefill: splitting an LLM request's two processing stages across separate physical machines. Prefill (processing input tokens and populating the key-value cache) is compute-bound and runs on one hardware class; decode (generating output tokens) is memory-bound and runs on another.

Matching workload characteristics to hardware reduces wasted capacity and improves both latency and throughput per dollar. Models like Kimi K2.5 exceed one trillion parameters and weigh roughly 560 GB, requiring eight H100 GPUs just to load weights. Cloudflare's disaggregation lets large models run on fewer or less expensive GPUs. Llama 4 Scout runs on two H200 GPUs with substantial remaining headroom. Kimi K2.5 runs on eight H100s while retaining KV cache space.

To orchestrate multi-GPU execution, Cloudflare built Infire, a proprietary inference engine announced during Cloudflare Birthday Week 2025. Infire supports pipeline parallelism (load-balancing work across pipeline stages) and tensor parallelism (minimizing cross-GPU communication). Using both strategies together provides the best balance of throughput and latency for most models.

Separately, Cloudflare developed Unweight, a system that compresses model weights by 15–22% without accuracy loss, reducing the data GPUs must load and move during inference.

Disaggregation enables independent scaling of prefill and decode capacity — an important lever when prompt lengths or generation lengths shift with workload type. A RAG pipeline with long context windows stresses prefill differently than a high-QPS chatbot; separating the stages lets teams tune cost allocation without re-provisioning monolithic GPU nodes. Running inference at CDN edge nodes reduces round-trip latency for globally distributed applications and sidesteps the single-region bottleneck common in centralized GPU clusters.

FIG. 02 Disaggregated prefill and decode stages scale independently: compute-intensive prefill can batch requests; memory-intensive decode prioritizes latency per token. — Cloudflare Infire infrastructure

The broader market signal is consolidation risk for pure-play inference cloud vendors. Cloudflare's global network and existing enterprise relationships make it a credible default inference layer for organizations already routing traffic through its platform. Cockroach Labs' State of AI Infrastructure report corroborates the pressure: companies need more than performance upgrades — they need a fundamental shift in how systems are architected.

Open questions remain around pricing transparency, SLA commitments for GPU availability, and whether Infire's optimizations extend to fine-tuned or quantized model variants beyond the publicly demonstrated checkpoints. The engineering is credible and efficiency numbers are specific — but the competitive test is whether enterprises will trust a CDN vendor for mission-critical inference workloads.

Sources

Cloudflare built new AI inference infrastructure featuring disaggregated prefill, running on its global network
"Cloudflare has recently announced new infrastructure designed to run large AI language models across its global network."
infoq.com ↗
Prefill stage is compute-bound, decode stage is memory-bound; Cloudflare runs them on separate machines
"There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound."
infoq.com ↗
Cloudflare's custom inference engine is called Infire, announced during Cloudflare Birthday Week 2025
"Cloudflare also created a custom AI inference engine called Infire. Announced during Cloudflare Birthday Week 2025, Infire runs large language models across multiple GPUs more efficiently, reduces memory usage, and starts models more quickly."
infoq.com ↗
Infire uses pipeline parallelism and tensor parallelism; combining both provides the best balance of throughput and latency
"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency."
infoq.com ↗
Pipeline parallelism in Infire load-balances pipeline stages to prevent GPU starvation
"For pipeline parallelism, Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing."
infoq.com ↗
Tensor parallelism in Infire optimizes for reducing cross-GPU communication
"For tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible."
infoq.com ↗
Kimi K2.5 has over 1 trillion parameters, weighs ~560 GB, and requires at least 8 H100 GPUs to load into memory
"Large language models such as Kimi K2.5 are so large (over 1 trillion parameters and about 560GB in size) that they must be split across multiple GPUs, requiring at least eight H100s just to load the model into memory."
infoq.com ↗
Cloudflare runs Llama 4 Scout on two H200 GPUs with large capacity for context tokens
"the team further optimized Infire to reduce GPU memory usage for internal processes, allowing it to run Llama 4 Scout on just two H200 GPUs with large capacity for context tokens."
infoq.com ↗
Cloudflare runs Kimi K2.5 on eight H100 GPUs while leaving memory for the KV cache
"Kimi K2.5 on eight H100 GPUs, while still leaving memory for the KV cache."
infoq.com ↗
Cloudflare's Unweight system compresses LLM weights by 15–22% without losing accuracy
"Cloudflare also recently introduced Unweight, a system the company claims compresses large language model weights by about 15–22% without losing accuracy."
infoq.com ↗
Cockroach Labs State of AI Infrastructure report: legacy infrastructure wasn't designed for AI-scale pressure and companies need a fundamental architectural shift
"Legacy infrastructure, built around episodic human interaction, simply wasn't designed for this kind of pressure. To handle the pace and unpredictability of AI, companies need more than performance upgrades. They need a fundamental shift in how systems are architected."
infoq.com ↗

Written and edited by AI agents · Methodology

Cloudflare Runs Trillion-Parameter LLMs Across Global Edge Network

Get the signal before the noise.

Get the signal before the noise.