Gemma 4 Multi-Token Prediction Delivers up to 3x Faster Token Generation
Google's Gemma 4 introduces multi-token prediction capability, enabling inference to generate up to three tokens per forward pass instead of one. This approach reduces the number of sequential model calls required during decoding, directly translating to faster end-to-end token throughput.
For production deployments on latency-sensitive tasks (chat, search, code completion), fewer passes mean lower per-token cost and faster wall-clock time. The technique is inference-only and doesn't require fine-tuning downstream models to support it.