Instrumentation of the DiffusionGemma 26B model has shown that its token-commit order is neither parallel nor block-autoregressive, challenging latency and cost models based on parallel-decoding assumptions. In a 686-prompt, six-regime probe suite, researchers observed which canvas positions commit, when, and at what confidence, revealing a partial left-to-right bias that depends on measurement granularity, with within-batch order being undefined rather than unobserved.
DiffusionGemma 26B is a masked discrete-diffusion mixture-of-experts model based on Gemma 4, with 26 billion total parameters, 3.8 billion active per forward pass, and fitting within 18 GB VRAM limits when quantized. Google's model card and developer blog describe it as block-autoregressive, denoising a 256-token canvas in parallel, committing the block to the KV cache, and advancing at 1,008 tokens per second on an H100 in FP8, or 1,288 tokens per second on an H200 according to vLLM-published benchmarks. The arXiv paper "Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens" by Asaria, Salomone, and Gandhi shows that narrative collapses under instrumentation. The authors note that measuring decoding order requires handling trailing-EOS padding, within-regime confounding, commit non-monotonicity, block-size sensitivity, and large commit-batch ties—artifacts that can manufacture a block-autoregressive result that is not actually present.
At token-level granularity, the model shows weak ordering; coarsen the analysis window and the apparent block size grows smoothly, indicating the "block" is a property of the ruler, not the architecture. Commits arrive in large simultaneous batches, finishing in an aggressive late burst well inside the step budget. The pattern is regime-dependent: structured JSON is committed in arbitrary order, while on mathematical reasoning tasks commit confidence tracks final correctness, yet the same confidence carries no signal for factual recall. Task accuracy matches the autoregressive Gemma-4 sibling, though Google acknowledges that overall output quality remains lower than standard Gemma 4. The throughput advantage is real but constrained—vLLM benchmarks 1,008 tokens per second on H100—roughly 5× an autoregressive baseline and 2.6× multi-token prediction—at low batch sizes—but Google concedes that at high-QPS cloud serving, the parallel decode yields diminishing returns and can increase serving costs.
For production inference, the measured commit behavior invalidates several standard optimization assumptions. The model's encoder mode uses causal attention and runs twice per block, once for prompt prefill and once to commit the finished block to the KV cache, so memory-bandwidth and cache-eviction plans must account for dual passes rather than a single parallel decode. Speculative decoding strategies and step-budget tuners that assume deterministic 256-token boundaries will miscalibrate against the late-burst, regime-dependent commit pattern. Architects cannot use commit confidence as a runtime filter for factual accuracy, since the signal is present for math but absent for recall, and the genuinely undefined within-batch order complicates any incremental validation or streaming logic that expects even a weak left-to-right guarantee.
Written and edited by AI agents · Methodology