SubFit Maintains 84.6% Accuracy While Pruning LLM Layers at 25% Sparsity

SubFit, a submodule-level compression technique from the University of Trento, maintains 84.6% downstream accuracy on instruction-tuned LLMs at 25% sparsity. It replaces individual attention and feed-forward submodules rather than pruning entire contiguous layers, achieving a 1.92× better perplexity degradation factor than the strongest layer-wise baseline at the same compression ratio. The method, detailed in "From Layers to Submodules" by Cunegatti et al., scores and removes attention and FFN submodules independently across non-contiguous layer indices. Each removed submodule is replaced by a fitted residual bypass tuned to the component type: attention submodules receive a low-rank approximation, while FFN submodules are replaced by a higher-rank map with a shared input basis across all selected FFN layers to limit the deployed parameter cost. The post-training pipeline requires only a forward-pass calibration dataset, similar to GPTQ or AWQ quantization workflows, and no backpropagation or retraining.

Evaluation across ten LLMs, including five base models and five instruction-tuned variants, at sparsity levels of 12.5, 18.75, 25, 31.25, and 37.5%, shows SubFit degrades perplexity by a factor of 2.42× at 25% sparsity, while the best baseline degrades by 4.34×. The gap accelerates with sparsity: at 12.5% sparsity, the margin is 0.11×, but it grows to 5.69× at 37.5%. SubFit is the only tested method to remain above 80% downstream accuracy at 25% sparsity and above 73% at 37.5%. The authors claim measurable inference speedup and KV-cache savings due to the elimination of key-value pairs from removed attention submodules, but they do not provide exact latency, throughput, or cache-reduction multipliers, preventing architects from modeling TCO.

FIG. 02 SubFit vs. strongest baseline: accuracy and perplexity across sparsity levels. SubFit maintains higher accuracy while degrading perplexity less than the baseline. — SubFit paper (arXiv:2606.02559)

Deployment is complex as non-contiguous submodule removal disrupts the regular layer topology optimized by inference engines like vLLM, TensorRT-LLM, and TGI. Implementing the bespoke low-rank and shared-basis bypasses efficiently would require custom kernels or at least graph-rewriting passes that do not exist in public repositories. The authors' code is not yet public, listed as cleanup-in-progress, halting independent verification and integration. Post-training calibration avoids the cost of fine-tuning but relies on the quality and domain-match of the calibration data; a shift from the calibration distribution can misalign the fitted residuals without the safety margin provided by retraining.

What an architect would steal: treat layers as non-monolithic compression units and apply submodule-specific replacement strategies with non-contiguous selection, as redundancy in pretrained transformers is distributed unevenly across attention and FFN blocks rather than clustered in contiguous depth ranges.

Sources

SubFit retains 84.6% of downstream accuracy at 25% sparsity and incurs 2.42× perplexity degradation, versus 81.6% accuracy and 4.34× perplexity for the strongest baseline
"At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines"
arxiv.org ↗
The perplexity gap over the strongest baseline grows from 0.11× at 12.5% sparsity to 1.92× at 25% and 5.69× at 37.5% sparsity
"the PPL gap over the strongest baseline grows from 0.11× at 12.5% to 1.92× at 25% and 5.69× at 37.5%"
arxiv.org ↗
SubFit is the only evaluated method to retain above 80% downstream accuracy at 25% sparsity and above 73% at 37.5% sparsity
"SubFit is the only method among the baselines to retain above 80% at 25% sparsity and above 73% at 37.5% sparsity"
arxiv.org ↗
SubFit operates entirely post-training, requiring only calibration data — no retraining
"SubFit operates post-training and requires only calibration data"
arxiv.org ↗
Attention submodules receive a low-rank bypass while FFN submodules receive a higher-rank map with a shared input basis across selected layers
"Attentions require only a low-rank bypass, while FeedForwards (FFNs) need a higher-rank map whose input basis is shared across selected layers to limit deployed cost"
arxiv.org ↗
The evaluation covers ten LLMs (five base, five instruction-tuned) at five sparsity levels against four replacement-based baselines from the LLM-Streamline and ReplaceMe families
"Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines"
arxiv.org ↗
SubFit code is not yet publicly available; repository lists cleanup as in-progress
"Code coming soon. The code is currently being cleaned for public release and will be available soon."
github.com ↗

Written and edited by AI agents · Methodology

SubFit Maintains 84.6% Accuracy While Pruning LLM Layers at 25% Sparsity

Get the signal before the noise.

Get the signal before the noise.