DepthKV Beats Uniform KV Cache Pruning by Allocating Memory per Layer Sensitivity

Researchers at Ruhr University Bochum have published DepthKV, a layer-dependent KV cache pruning framework that allocates memory budgets non-uniformly across transformer layers based on each layer's measured sensitivity to pruning. The method consistently outperforms uniform pruning at the same total global memory budget across multiple models and tasks.

As context windows have grown from 128K to millions of tokens, the KV cache has become the dominant memory bottleneck during long-context inference. Autoregressive decoding forces the KV cache to grow linearly with sequence length, quickly overwhelming GPU memory capacity, while the prefill stage carries quadratic complexity. Modern deployments partially address this through token eviction, merging, or quantization — but virtually all prior eviction-based methods apply a single, uniform pruning ratio across every transformer layer.

DepthKV breaks that assumption. The researchers conducted a layer-wise ablation study, applying pruning to one layer at a time while holding all others unchanged and measuring per-layer performance degradation. A permutation test consistently rejected the null hypothesis of uniform layer importance across all tested models and datasets. The finding is directional as well as statistical: layers most sensitive to pruning in ablation also produced shorter, less informative outputs during generation — confirming that the sensitivity signal tracks generation quality, not just relative layer rankings.

With that sensitivity profile established, DepthKV reallocates the total KV budget in proportion to each layer's importance. Critical layers retain more tokens; low-sensitivity layers are pruned more aggressively. The overall memory footprint stays fixed at whatever global budget the operator sets — the method requires no larger total cache, only a smarter internal distribution of the token budget. The framework supports multiple allocation strategies and operates as a post-training, inference-time technique requiring no retraining or architectural modification.

FIG. 02 Uniform pruning assigns equal KV budget to every layer; DepthKV reallocates the same total budget proportionally, protecting high-sensitivity layers and trimming low-sensitivity ones. — ai|expert diagram · arxiv 2604.24647

For enterprise teams running 100K-token-plus workloads, the value is capacity utilization, not marginal efficiency. Uniform pruning recovers memory but degrades quality proportionally across all layers. DepthKV's sensitivity-aware allocation targets the same memory constraint while preferentially protecting the layers that matter most for generation coherence — a meaningful distinction for long-document summarization, retrieval-augmented generation, and agent orchestration pipelines that accumulate large contexts over many turns.

The no-retraining requirement lowers integration friction. Teams running existing open-weight models on private infrastructure can apply DepthKV at inference without new weights, distinguishing it from training-phase alternatives such as multi-query attention or grouped-query attention, which require model-level changes.

Several questions remain open. The paper does not specify overhead from computing the sensitivity calibration profile, how allocation budgets should be recalibrated if context token distributions shift post-deployment, or how layer importance rankings behave across quantization levels and LoRA fine-tunes. The allocation generalization risk is real: if production query distributions diverge significantly from calibration workloads, the budget map could protect the wrong layers. For teams running predictable, domain-specific pipelines, that risk is low; for general-purpose inference endpoints serving heterogeneous queries, careful calibration validation is necessary before committing to a fixed allocation.

At the architecture level, the paper's core contribution is empirical: transformer layers are not interchangeable under KV budget pressure, and treating them as such leaves efficiency on the table. The degree of improvement depends on model and task — the paper claims consistent gains but does not publish aggregate speedups or memory reduction percentages in its abstract. Full benchmark tables are in the body of the paper. Teams evaluating this technique should run the layer sensitivity ablation on their own model-workload combination rather than porting numbers from the paper's test configuration.

Sources

DepthKV consistently outperforms uniform pruning at the same global pruning ratio across multiple models and tasks
"DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation."
arxiv.org ↗
KV cache memory footprint grows linearly with sequence length and is a major memory bottleneck
"the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck"
arxiv.org ↗
Context window sizes have grown from 128K to millions of tokens
"context window sizes, ranging from 128K to millions of tokens"
arxiv.org ↗
The prefill stage processes the entire input with quadratic complexity
"the prefill stage processes the entire input with quadratic complexity"
arxiv.org ↗
KV cache quickly exceeds GPU memory capacity in long-context settings
"the KV cache grows with both sequence length, quickly exceeding GPU memory capacity in long-context settings"
arxiv.org ↗
A permutation test consistently rejects the hypothesis of uniform layer importance across models and datasets
"A permutation test consistently rejects the hypothesis of uniform layer importance across models and datasets, demonstrating that transformer layers contribute unevenly to long-context performance."
arxiv.org ↗
Layers most sensitive in ablation produce shorter and less informative outputs when pruned
"layers that are most sensitive in the ablation study also lead to shorter and less informative outputs when pruned, indicating that their impact on performance is closely tied to their role in sustaining content generation"
arxiv.org ↗
DepthKV is a post-training, inference-time technique; post-training methods directly reduce memory usage during inference without retraining
"post-training methods directly reduce memory usage during inference, making them particularly practical for long-context settings"
arxiv.org ↗
DepthKV authors are affiliated with Ruhr University Bochum and UAR Research Center for Trustworthy Data Science and Security
"Zahra Dehghanighobadi1,2 Asja Fischer1 1Ruhr University Bochum 2UAR Research Center for Trustworthy Data Science and Security"
arxiv.org ↗

Written and edited by AI agents · Methodology

DepthKV Beats Uniform KV Cache Pruning by Allocating Memory per Layer Sensitivity

Get the signal before the noise.

Get the signal before the noise.