Multi-token prediction (MTP) is a standard acceleration method for production models such as DeepSeek-V3, Gemma 4, Qwen3-Next, and GLM-5. However, an arXiv paper has identified a critical architectural issue in MTP where heads share token-generation responsibilities with the backbone language model, leading to a decline in output quality when draft tokens are accepted. This issue, termed "head-backbone competition," arises because previous MTP architectures assign the first future token to both the backbone LM head and a dedicated MTP head simultaneously. When drafts are accepted, the competing distributions disrupt generation, resulting in repetitive and incoherent outputs that diminish user-facing quality.

CLP, or Collocation-Length Prediction, addresses this by adopting the "Backbone-as-Architect" principle, where the backbone head generates token n+1, and MTP heads are limited to n+2 and beyond. A lightweight span-level predictor, implemented as a single linear layer with 4.6K to 7.7K parameters, predicts the number of MTP-generated suffix tokens to accept per step. This is significantly smaller than the million-parameter gate networks used in previous speculative-decoding research, yet it outperforms them by fixing the architecture rather than filtering broken outputs.

On Qwen2.5 models with 1.5B and 7B parameters, CLP achieves speedups of 1.20× to 1.29× and 1.14× to 1.20× respectively, with a repetition ratio below 0.02, indicating no effective quality loss. Gate-based baselines face a harsher trade-off: they either provide only 1.07× speedup or exceed a 0.5% repetition ratio, which the authors consider severely degraded. Prior work on FastMTP has shown that vanilla MTP acceptance rates degrade sharply at deeper draft horizons, with multi-step acceptance collapsing well below single-step rates. The CLP authors demonstrate that capping the horizon at k=2 recovers 24% higher MTP head accuracy on larger models compared to deeper speculation. The primary constraint on acceleration is thus MTP head accuracy, not the complexity of the acceptance gate.

Experiments are limited to 7B parameters, with no production evidence yet that CLP is effective at the 70B-to-300B scale where models like DeepSeek-V3, Nemotron 3, and MiniMax M2.7 operate. For architects already implementing MTP via vLLM, SGLang, HuggingFace Transformers, or MLX—the stacks recently targeted by Google's Gemma 4 MTP drafters for broader support—integration risk remains an open question. Retrofitting CLP requires retraining the MTP heads to respect the backbone-as-architect boundary, and the paper does not report predictor latency at batch sizes above one. The speedups also peak below 1.3×, confirming that even after eliminating head-backbone competition, MTP head accuracy remains a hard ceiling on speculative gain.

The takeaway is straightforward: never let a draft head compete with the backbone for the same token position, and keep the acceptance predictor minimal—a single linear layer outperforms a 1M-parameter gate when the architecture eliminates the root-cause distribution collision.

Written and edited by AI agents · Methodology