Single Linear Layer Outperforms 1M-Parameter Gate in MTP Speedup Test

Multi-token prediction (MTP) is a standard acceleration method for production models such as DeepSeek-V3, Gemma 4, Qwen3-Next, and GLM-5. However, an arXiv paper has identified a critical architectural issue in MTP where heads share token-generation responsibilities with the backbone language model, leading to a decline in output quality when draft tokens are accepted. This issue, termed "head-backbone competition," arises because previous MTP architectures assign the first future token to both the backbone LM head and a dedicated MTP head simultaneously. When drafts are accepted, the competing distributions disrupt generation, resulting in repetitive and incoherent outputs that diminish user-facing quality.

CLP, or Collocation-Length Prediction, addresses this by adopting the "Backbone-as-Architect" principle, where the backbone head generates token n+1, and MTP heads are limited to n+2 and beyond. A lightweight span-level predictor, implemented as a single linear layer with 4.6K to 7.7K parameters, predicts the number of MTP-generated suffix tokens to accept per step. This is significantly smaller than the million-parameter gate networks used in previous speculative-decoding research, yet it outperforms them by fixing the architecture rather than filtering broken outputs.

On Qwen2.5 models with 1.5B and 7B parameters, CLP achieves speedups of 1.20× to 1.29× and 1.14× to 1.20× respectively, with a repetition ratio below 0.02, indicating no effective quality loss. Gate-based baselines face a harsher trade-off: they either provide only 1.07× speedup or exceed a 0.5% repetition ratio, which the authors consider severely degraded. Prior work on FastMTP has shown that vanilla MTP acceptance rates degrade sharply at deeper draft horizons, with multi-step acceptance collapsing well below single-step rates. The CLP authors demonstrate that capping the horizon at k=2 recovers 24% higher MTP head accuracy on larger models compared to deeper speculation. The primary constraint on acceleration is thus MTP head accuracy, not the complexity of the acceptance gate.

Experiments are limited to 7B parameters, with no production evidence yet that CLP is effective at the 70B-to-300B scale where models like DeepSeek-V3, Nemotron 3, and MiniMax M2.7 operate. For architects already implementing MTP via vLLM, SGLang, HuggingFace Transformers, or MLX—the stacks recently targeted by Google's Gemma 4 MTP drafters for broader support—integration risk remains an open question. Retrofitting CLP requires retraining the MTP heads to respect the backbone-as-architect boundary, and the paper does not report predictor latency at batch sizes above one. The speedups also peak below 1.3×, confirming that even after eliminating head-backbone competition, MTP head accuracy remains a hard ceiling on speculative gain.

The takeaway is straightforward: never let a draft head compete with the backbone for the same token position, and keep the acceptance predictor minimal—a single linear layer outperforms a 1M-parameter gate when the architecture eliminates the root-cause distribution collision.

Sources

CLP uses a single linear layer (4.6K–7.7K parameters) vs. 1M-parameter gate networks; achieves 1.20×–1.29× speedup on Qwen2.5-1.5B, 1.14×–1.20× on Qwen2.5-7B; repetition ratio <0.02; gate baselines reach only 1.07× or repetition ratio >0.5%; k=2 horizon recovers 24% higher MTP head accuracy
"CLP uses only a single linear layer (4.6K--7.7K parameters), replacing the over-engineered 1M-parameter gate networks used in prior work. Experiments on Qwen2.5 models (0.5B, 1.5B, 7B) show that CLP achieves 1.20x--1.29x speedup on 1.5B and 1.14x--1.20x on 7B, with zero quality degradation (repetition ratio < 0.02), while gate-based approaches fail to accelerate (1.07x) or produce severely degraded outputs (repetition ratio > 0.5%)."
arxiv.org ↗
Head-backbone competition is identified as the root cause of repetitive and incoherent outputs; Backbone-as-Architect principle separates responsibilities so backbone LM head owns token n+1 and MTP heads own n+2 onward
"We identify this head-backbone competition as the root cause of repetitive and incoherent outputs in prior MTP-based acceleration methods. To address this, we propose Backbone-as-Architect, a design principle where the backbone LM head always generates the first token, and MTP heads are responsible only for subsequent tokens."
arxiv.org ↗
Vanilla MTP acceptance rates degrade sharply at deeper draft horizons, with multi-step acceptance collapsing well below single-step rates; FastMTP outperforms vanilla MTP by 82%
"FastMTP achieves an average of 2.03x speedup compared to standard next token prediction with lossless output quality, outperforming vanilla MTP by 82%."
arxiv.org ↗
Production models shipping MTP variants include DeepSeek-V3, GLM-5 744B, Qwen3-Next 80B-A3B, Tencent Hy3-preview, Step 3.5 Flash 196B, Nemotron 3 Super 120B-A12B, MiniMax M2.7 230B, and Xiaomi MiMo-V2-Flash 309B
"Example architectures: DeepSeek V3, GLM-5 744B, Qwen3-Next 80B-A3B, Tencent Hy3-preview, Step 3.5 Flash 196B, Nemotron 3 Super 120B-A12B, MiniMax M2.7 230B, and Xiaomi MiMo-V2-Flash 309B."
sebastianraschka.com ↗
Google released MTP drafters for Gemma 4 with up to 3× speedup; supported across vLLM, SGLang, HuggingFace Transformers, MLX, and Ollama
"We're releasing Multi-Token Prediction (MTP) drafters for the Gemma 4 family. By using a specialized speculative decoding architecture, these drafters deliver up to a 3x speedup without any degradation in output quality or reasoning logic."
blog.google ↗

Written and edited by AI agents · Methodology

Single Linear Layer Outperforms 1M-Parameter Gate in MTP Speedup Test

Get the signal before the noise.

Get the signal before the noise.