Google's DiffusionGemma Hits 1,000 Tokens Per Second

Google DeepMind launched DiffusionGemma, an Apache 2.0 experimental model for text generation via diffusion, on June 10. The model achieves over 1,000 tokens per second on a single NVIDIA H100, offering a 4x latency improvement over autoregressive baselines in single-user scenarios. This 26-billion-parameter mixture-of-experts model activates only 3.8 billion parameters per forward pass and generates entire 256-token blocks in parallel, shifting the inference bottleneck from memory bandwidth to compute. This provides Tensor Cores with a contiguous workload instead of the idle gaps typical of per-token autoregressive generation.

The architecture integrates a diffusion head into the Gemma 4 26B-A4B backbone, using the same prefill infrastructure. Prefill processes prompts and fills the KV cache with standard causal attention; denoising uses bidirectional attention over a canvas of 256 placeholder tokens, securing 15–20 high-confidence tokens per forward pass according to the Unsloth HF model card and using them as context to refine the rest. The model can re-noise low-confidence positions in subsequent passes, a form of real-time self-correction not possible with autoregressive models. For outputs longer than 256 tokens, Block Autoregressive Diffusion commits each fully denoised block to the KV cache and initializes a new canvas based on prior history.

DiffusionGemma is the first diffusion LLM natively supported in the vLLM framework via the model runner v2 ModelState and existing speculative decoding paths with minimal scheduler changes. It is also supported in Hugging Face Transformers, MLX, Unsloth, and NVIDIA NeMo. FP8 and NVFP4 checkpoints are available through Red Hat's AI hub, and a quantized build fits within 18 GB of VRAM. Fine-tuning employs the Hackable Diffusion JAX toolbox.

Throughput numbers are hardware-specific and regime-dependent. The official vLLM blog reports 1,288 tokens per second on an H200 with FP8 quantization under vLLM, while community reports indicate 700+ tokens per second on an RTX 5090, and up to 2,000 tokens per second on a DGX Station. However, unified-memory architectures like NVIDIA's DGX Spark achieve only 150 tokens per second, remaining memory-bandwidth-bound despite parallel decode. Apple Silicon also sees muted benefits, making the hardware bottleneck shift conditional on discrete GPUs with high compute-to-bandwidth ratios.

FIG. 02 DiffusionGemma throughput on H100, H200, and DGX platforms. Performance varies 13x depending on memory architecture. — vLLM blog, NVIDIA blogs

Google states that DiffusionGemma's output quality is below that of standard Gemma 4 and recommends against deploying it for production-quality workloads without task-specific fine-tuning. It is also unsuitable for high-QPS cloud serving, where autoregressive engines batch thousands of requests to saturate compute efficiently; parallel block decoding offers diminishing returns under heavy batching and can increase cost-per-token. A Sudoku fine-tuning demo highlighted the reasoning gap: base DiffusionGemma scored around zero percent on symbolic constraint-satisfaction puzzles, while supervised fine-tuning in JAX lifted accuracy to 80 percent and enabled early exiting that reduced inference steps.

For architects running local or low-concurrency inference, the transferable pattern is the scheduling inversion itself—inserting a diffusion denoising loop with bidirectional attention into an existing engine to convert memory-bound, single-user inference into a compute-bound Tensor Core workload, portably and with minimal scheduler changes.

Sources

DiffusionGemma generates over 1,000 tokens per second on a single NVIDIA H100 and 4x faster than equivalent autoregressive baselines in single-user regimes
"1000+ tokens per second on a single NVIDIA H100, delivering up to 4x faster text generation on GPUs"
deepmind.google ↗
26B MoE model activates only 3.8B parameters per forward pass and generates 256-token blocks in parallel, shifting inference bottleneck from memory bandwidth to compute
"Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits"
deepmind.google ↗
Prefill uses causal attention; denoising uses bidirectional attention over a 256-token canvas with real-time self-correction
"Encoder mode uses causal attention and writes to the KV cache. Decoder mode uses bidirectional attention and only reads the KV cache. This is the denoising mode — every position in the canvas can attend to every other position."
vllm-project.github.io ↗
Achieves 15–20 high-confidence tokens per forward pass
"achieves low latency by generating 15-20 tokens per forward pass, unlocking per user generation speeds exceeding 1100 tokens per second"
huggingface.co ↗
Block Autoregressive Diffusion commits each fully denoised 256-token block to the KV cache then initializes a fresh canvas for longer outputs
"Block Autoregressive Diffusion for Variable Length Generation: For sequences longer than 256 tokens, once a 256-token block is fully denoised, the model processes and commits it to the KV cache."
developers.googleblog.com ↗
DiffusionGemma is the first diffusion LLM natively supported in vLLM, built on model runner v2 ModelState and existing speculative decoding paths with minimal scheduler changes
"Google's DiffusionGemma is a 26B-parameter discrete diffusion language model built on the Gemma4 backbone, and the first dLLM supported in vLLM. We integrated DiffusionGemma into vLLM using model runner v2's new ModelState abstraction."
vllm-project.github.io ↗
vLLM blog confirms 1,288 tokens/sec on H200 with FP8 quantization (~6× autoregressive baseline); 1,008 tokens/sec on H100
"The FP8 diffusion model reaches 1,288 generation tokens per second on H200 (~6× a standard autoregressive baseline and ~3× one using multi-token prediction) and 1,008 tokens per second on H100 (~5× and ~2.6×, respectively)."
vllm-project.github.io ↗
DGX Station reaches up to 2,000 tokens/sec; DGX Spark achieves only 150 tokens/sec due to its memory-bandwidth-bound unified architecture
"DiffusionGemma delivers 1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU, 150 tokens/sec on NVIDIA DGX Spark and up to 2,000 tokens/sec on NVIDIA DGX Station"
blogs.nvidia.com ↗
Apple Silicon unified-memory architectures will not see the same acceleration because they are memory-bandwidth-bound rather than compute-bound
"unified-memory architectures like those in Apple Silicon Macs — which are often memory-bandwidth-bound rather than compute-bound during inference — may not see the same acceleration over autoregressive models"
blog.google ↗
Google explicitly recommends standard Gemma 4 for production quality; DiffusionGemma's quality is lower and parallel decoding raises cost-per-token in high-QPS cloud serving
"In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs."
deepmind.google ↗
Base DiffusionGemma scored ~0% on Sudoku; after JAX SFT recipe correctness rose to 80% with early exiting that cut inference steps
"While the base DiffusionGemma model is not specifically trained to solve Sudoku puzzles (~0% success rate), applying the simple JAX SFT recipe on a Sudoku dataset raises correctness to 80% success, while decreasing the overall inference step count."
developers.googleblog.com ↗

Written and edited by AI agents · Methodology

Google's DiffusionGemma Hits 1,000 Tokens Per Second

Get the signal before the noise.

Get the signal before the noise.