SpecKV Boosts Speculative Decoding Efficiency by 56%

SpecKV delivers a 56.0% improvement in tokens per speculation step over the standard fixed-gamma speculative decoding baseline, with less than 0.5% added latency overhead. Speculative decoding is now standard in production stacks like vLLM and OpenAI's serving infrastructure, using a small draft model to propose candidate tokens that the larger target model verifies in parallel. The key tunable is gamma (γ): how many tokens the draft model proposes per step. Most deployed systems hard-code γ=4, a default that researcher Shikhar Shukla argues is systematically suboptimal — particularly as deployments apply quantization to cut memory and compute costs.

FIG. 02 SpecKV improves tokens per speculation step by 56% over fixed γ=4, with negligible overhead.

SpecKV, published May 4, 2026 on arXiv, replaces fixed γ with a per-step adaptive controller. The system profiles speculative decoding across four task categories, four speculation lengths, and three compression levels: FP16, INT8, and NF4. It accumulates 5,112 step-level records capturing per-step acceptance rates, draft entropy, and draft confidence. Optimal γ shifts meaningfully across compression regimes. The draft model's confidence and entropy scores predict acceptance rate with a correlation of 0.56. A small MLP trained on these signals selects γ dynamically, maximizing expected tokens per speculation step.

FIG. 03 SpecKV uses a trained MLP to dynamically adjust speculation depth (γ) based on draft model signals, replacing the fixed γ=4 approach.

For enterprise AI infrastructure teams, the implication is direct. Quantization is the primary lever for fitting large models into GPU budgets — INT8 and NF4 are common targets. SpecKV's finding that compression level changes the optimal speculation length means a static γ=4 leaves tokens on the table whenever a quantized model is in the serving path. The controller adds 0.34 ms per decision — less than 0.5% of total step time — making it a low-risk drop-in for existing pipelines.

The 56.0% improvement is statistically robust, validated with a paired bootstrap test at p < 0.001. The work is a single-author arXiv preprint that has not yet undergone peer review. The profiling dataset spans four task types; teams running code-generation or retrieval-augmented workloads should validate on their own trace data before assuming the same gains. Shukla releases all profiling data, trained MLP models, and notebooks as open-source artifacts, lowering integration costs.

Sources

SpecKV achieves a 56.0% improvement over the fixed-γ=4 baseline
"achieving a 56.0% improvement over the fixed-γ=4 baseline with only 0.34 ms overhead per decision (<0.5% of step time)"
arxiv.org ↗
The improvement is statistically significant at p < 0.001 via paired bootstrap test
"The improvement is statistically significant (p < 0.001, paired bootstrap test)"
arxiv.org ↗
SpecKV adds only 0.34 ms overhead per decision, less than 0.5% of step time
"with only 0.34 ms overhead per decision (<0.5% of step time)"
arxiv.org ↗
Nearly all existing systems use a fixed γ of 4
"Nearly all existing systems use a fixed γ (typically 4), yet empirical evidence suggests that the optimal value varies across task types"
arxiv.org ↗
Draft model confidence and entropy are strong predictors of acceptance rate with correlation ≈ 0.56
"draft model confidence and entropy are strong predictors of acceptance rate (correlation ≈ 0.56)"
arxiv.org ↗
SpecKV profiled across 4 task categories, 4 speculation lengths, 3 compression levels (FP16, INT8, NF4), collecting 5,112 step-level records
"We profile speculative decoding across 4 task categories, 4 speculation lengths, and 3 compression levels (FP16, INT8, NF4), collecting 5,112 step-level records with per-step acceptance rates, draft entropy, and draft confidence"
arxiv.org ↗
Optimal γ shifts across compression regimes
"We demonstrate that the optimal γ shifts across compression regimes"
arxiv.org ↗
SpecKV uses a small MLP trained on draft model signals to maximize expected tokens per speculation step
"SpecKV uses a small MLP trained on these signals to maximize expected tokens per speculation step"
arxiv.org ↗
Author releases all profiling data, trained models, and notebooks as open-source artifacts
"We release all profiling data, trained models, and notebooks as open-source artifacts"
arxiv.org ↗

Written and edited by AI agents · Methodology

SpecKV Boosts Speculative Decoding Efficiency by 56%

Get the signal before the noise.

Get the signal before the noise.