DiffusionGemma's Actual Decoding Contradicts Google's Block-Autoregressive Claims

Instrumentation of the DiffusionGemma 26B model has shown that its token-commit order is neither parallel nor block-autoregressive, challenging latency and cost models based on parallel-decoding assumptions. In a 686-prompt, six-regime probe suite, researchers observed which canvas positions commit, when, and at what confidence, revealing a partial left-to-right bias that depends on measurement granularity, with within-batch order being undefined rather than unobserved.

DiffusionGemma 26B is a masked discrete-diffusion mixture-of-experts model based on Gemma 4, with 26 billion total parameters, 3.8 billion active per forward pass, and fitting within 18 GB VRAM limits when quantized. Google's model card and developer blog describe it as block-autoregressive, denoising a 256-token canvas in parallel, committing the block to the KV cache, and advancing at 1,008 tokens per second on an H100 in FP8, or 1,288 tokens per second on an H200 according to vLLM-published benchmarks. The arXiv paper "Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens" by Asaria, Salomone, and Gandhi shows that narrative collapses under instrumentation. The authors note that measuring decoding order requires handling trailing-EOS padding, within-regime confounding, commit non-monotonicity, block-size sensitivity, and large commit-batch ties—artifacts that can manufacture a block-autoregressive result that is not actually present.

At token-level granularity, the model shows weak ordering; coarsen the analysis window and the apparent block size grows smoothly, indicating the "block" is a property of the ruler, not the architecture. Commits arrive in large simultaneous batches, finishing in an aggressive late burst well inside the step budget. The pattern is regime-dependent: structured JSON is committed in arbitrary order, while on mathematical reasoning tasks commit confidence tracks final correctness, yet the same confidence carries no signal for factual recall. Task accuracy matches the autoregressive Gemma-4 sibling, though Google acknowledges that overall output quality remains lower than standard Gemma 4. The throughput advantage is real but constrained—vLLM benchmarks 1,008 tokens per second on H100—roughly 5× an autoregressive baseline and 2.6× multi-token prediction—at low batch sizes—but Google concedes that at high-QPS cloud serving, the parallel decode yields diminishing returns and can increase serving costs.

For production inference, the measured commit behavior invalidates several standard optimization assumptions. The model's encoder mode uses causal attention and runs twice per block, once for prompt prefill and once to commit the finished block to the KV cache, so memory-bandwidth and cache-eviction plans must account for dual passes rather than a single parallel decode. Speculative decoding strategies and step-budget tuners that assume deterministic 256-token boundaries will miscalibrate against the late-burst, regime-dependent commit pattern. Architects cannot use commit confidence as a runtime filter for factual accuracy, since the signal is present for math but absent for recall, and the genuinely undefined within-batch order complicates any incremental validation or streaming logic that expects even a weak left-to-right guarantee.

FIG. 02 DiffusionGemma throughput advantage on H200 and H100 hardware (vLLM FP8 benchmark vs. autoregressive baseline). — vLLM, 2026

Sources

686-prompt, six-regime probe suite shows DiffusionGemma 26B decodes neither in parallel nor block-autoregressive order; partial left-to-right commit bias whose apparent strength depends on measurement granularity; within-batch order genuinely undefined
"Across a 686-prompt, six-regime probe suite we find that its decoding is neither parallel nor block-autoregressive: it follows a partial left-to-right commit bias whose apparent strength depends almost entirely on the granularity at which you look."
arxiv.org ↗
Model commits in large simultaneous batches, finishing in a short late burst well inside the step budget; commit confidence tracks math correctness but carries no signal on factual recall; JSON committed in essentially arbitrary order
"The model commits in large simultaneous batches, leaving much of the within-batch order genuinely undefined rather than merely unobserved. The behaviour is regime-dependent: structured JSON is committed in essentially arbitrary order, and a position's commit confidence tracks correctness on mathematical reasoning but carries no signal on factual recall."
arxiv.org ↗
'Block size' is an artifact of the measuring ruler rather than the architecture
"Order is weak token by token and strengthens smoothly as the analysis is coarsened, so the model's 'block size' turns out to be an artifact of the measuring ruler rather than the architecture."
arxiv.org ↗
Google describes DiffusionGemma as block-autoregressive with encoder running twice per block — once for prompt prefill and once to commit a finished block to the KV cache
"Prefill / Incremental Prefill (Causal): Uses causal attention to ingest the prompt context and write to the KV cache. This runs once to prefill the initial context and then once per block to append each finalized 256-token canvas to the KV cache before proceeding to denoising the next canvas."
developers.googleblog.com ↗
DiffusionGemma 26B MoE model activates 3.8B parameters, fits within 18 GB VRAM quantized, delivers 1000+ tokens/sec on H100
"Designed as a 26B Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, allowing quantized deployment within 18 GB VRAM limits."
developers.googleblog.com ↗
vLLM benchmarks: FP8 model reaches 1,288 tokens/sec on H200 (~6× autoregressive baseline, ~3× multi-token prediction) and 1,008 tokens/sec on H100 (~5× autoregressive baseline, ~2.6× multi-token prediction)
"The FP8 diffusion model reaches 1,288 generation tokens per second on H200 (~6× a standard autoregressive baseline and ~3× one using multi-token prediction) and 1,008 tokens per second on H100 (~5× and ~2.6×, respectively)."
vllm.ai ↗
At high-QPS cloud serving, DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs
"In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs."
blog.google ↗

Written and edited by AI agents · Methodology

DiffusionGemma's Actual Decoding Contradicts Google's Block-Autoregressive Claims

Get the signal before the noise.

Get the signal before the noise.