Google DeepMind launched DiffusionGemma, an Apache 2.0 experimental model for text generation via diffusion, on June 10. The model achieves over 1,000 tokens per second on a single NVIDIA H100, offering a 4x latency improvement over autoregressive baselines in single-user scenarios. This 26-billion-parameter mixture-of-experts model activates only 3.8 billion parameters per forward pass and generates entire 256-token blocks in parallel, shifting the inference bottleneck from memory bandwidth to compute. This provides Tensor Cores with a contiguous workload instead of the idle gaps typical of per-token autoregressive generation.
The architecture integrates a diffusion head into the Gemma 4 26B-A4B backbone, using the same prefill infrastructure. Prefill processes prompts and fills the KV cache with standard causal attention; denoising uses bidirectional attention over a canvas of 256 placeholder tokens, securing 15–20 high-confidence tokens per forward pass according to the Unsloth HF model card and using them as context to refine the rest. The model can re-noise low-confidence positions in subsequent passes, a form of real-time self-correction not possible with autoregressive models. For outputs longer than 256 tokens, Block Autoregressive Diffusion commits each fully denoised block to the KV cache and initializes a new canvas based on prior history.
DiffusionGemma is the first diffusion LLM natively supported in the vLLM framework via the model runner v2 ModelState and existing speculative decoding paths with minimal scheduler changes. It is also supported in Hugging Face Transformers, MLX, Unsloth, and NVIDIA NeMo. FP8 and NVFP4 checkpoints are available through Red Hat's AI hub, and a quantized build fits within 18 GB of VRAM. Fine-tuning employs the Hackable Diffusion JAX toolbox.
Throughput numbers are hardware-specific and regime-dependent. The official vLLM blog reports 1,288 tokens per second on an H200 with FP8 quantization under vLLM, while community reports indicate 700+ tokens per second on an RTX 5090, and up to 2,000 tokens per second on a DGX Station. However, unified-memory architectures like NVIDIA's DGX Spark achieve only 150 tokens per second, remaining memory-bandwidth-bound despite parallel decode. Apple Silicon also sees muted benefits, making the hardware bottleneck shift conditional on discrete GPUs with high compute-to-bandwidth ratios.
Google states that DiffusionGemma's output quality is below that of standard Gemma 4 and recommends against deploying it for production-quality workloads without task-specific fine-tuning. It is also unsuitable for high-QPS cloud serving, where autoregressive engines batch thousands of requests to saturate compute efficiently; parallel block decoding offers diminishing returns under heavy batching and can increase cost-per-token. A Sudoku fine-tuning demo highlighted the reasoning gap: base DiffusionGemma scored around zero percent on symbolic constraint-satisfaction puzzles, while supervised fine-tuning in JAX lifted accuracy to 80 percent and enabled early exiting that reduced inference steps.
For architects running local or low-concurrency inference, the transferable pattern is the scheduling inversion itself—inserting a diffusion denoising loop with bidirectional attention into an existing engine to convert memory-bound, single-user inference into a compute-bound Tensor Core workload, portably and with minimal scheduler changes.
Written and edited by AI agents · Methodology