Shannon-Hartley Theorem Explains LLM Quantization Regressions

A paper from ByteDance Seed, the University of Virginia, and UC Berkeley applies the Shannon-Hartley theorem to explain two production failure modes in LLMs: catastrophic overtraining (where additional pretraining degrades downstream fine-tuning) and quantization-induced degradation (where a more-trained model tolerates lower bit-width worse).

The mapping is direct. Model parameters become channel bandwidth; training tokens become signal power. The Shannon Scaling Law computes signal-to-noise ratio (SNR) for training. Scale model size or token count without preserving sufficient SNR, and you amplify noise instead of learning signal. Loss curves turn U-shaped: improvement, then basin, then degradation. Classical power laws cannot fit that shape.

The formal capacity term resembles a modified Shannon-Hartley equation where N (parameters) controls bandwidth and D (tokens) drives signal power. The interaction between the denominator's (D·N) cross-term and the signal numerator generates the U-shaped basin.

FIG. 02 The Shannon-Hartley analogy: model parameters act as channel bandwidth; training tokens provide signal power. Quantization or scaling without SNR amplification introduces noise. — ByteDance Seed, UVA, UC Berkeley

Validation ran on Pythia and OLMo2 model families under three regimes: injected Gaussian noise, INT quantization, and supervised fine-tuning on math, QA, and code tasks. The Shannon Scaling Law outperformed classical scaling laws and recent perturbation-aware extensions across all conditions. The strongest result: the law was fit on Pythia models up to 6.9B parameters trained on up to 180B tokens, then predicted the held-out 12B model trained on up to 307B tokens. Pooled R² for that extrapolation was 0.847.

For quantization, the theory formalizes an empirical pattern practitioners know but lacked principled explanation for: larger or more extensively pretrained models are paradoxically more vulnerable to precision reduction. A high-SNR model has tightly packed weight distributions that lose more information per bit dropped. The pattern holds across Pythia and OLMo2 and aligns with earlier low-bit quantization literature.

FIG. 03 Shannon Scaling Law predicts unseen model scales (R²=0.847) while classical and perturbation-aware baselines fail to capture loss basin structure. — arxiv.org/abs/2605.23901

Testing covers only Pythia (up to 12B) and OLMo2. Whether the formula holds at 70B, 400B, or in mixture-of-experts architectures like DeepSeek-V4 (1.6T) and Kimi K2.6 (1T) remains untested. The paper also lacks tooling to compute SNR in real time during training. To apply this framework, practitioners would need to estimate noise terms and monitor SNR trajectory — neither is specified. The engineering work remains open.

If you are hitting U-shaped loss on SFT or seeing quantization regressions on a well-trained checkpoint, the Shannon framing offers a diagnostic: insufficient SNR, not insufficient scale. The fix is data quality or architecture noise reduction, not more tokens or bigger models.

Sources

Shannon Scaling Law models LLM training as information transmission over a noisy channel grounded in the Shannon-Hartley theorem, mapping model parameters to channel bandwidth and training tokens to signal power
"By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise."
arxiv.org ↗
Scaling model size or data without preserving sufficient SNR amplifies noise and induces U-shaped performance degradation
"This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation."
arxiv.org ↗
Validated on Pythia and OLMo2 under Gaussian noise, quantization, and SFT on math, QA, and code tasks
"We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks."
arxiv.org ↗
Fitted on ≤6.9B Pythia models with ≤180B tokens, the Shannon Scaling Law predicts the unseen 12B model up to 307B tokens at pooled R²=0.847, while monotonic baselines collapse
"fitted on ≤6.9B Pythia models with ≤180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled R²=0.847, while monotonic baselines collapse."
arxiv.org ↗
Classical power-law scaling laws fail to explain catastrophic overtraining and quantization-induced degradation
"Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute."
arxiv.org ↗
Larger or more extensively trained models are paradoxically more susceptible to quantization-induced degradation
"ouyang2024lowbitquantizationfavorsundertrained, kumar2024scalinglawsprecision observe that larger or more extensively trained models are paradoxically more susceptible to Quantization-induced Degradation (QiD)."
arxiv.org ↗
The Shannon Scaling Law outperforms classical scaling laws and recent perturbation-aware laws, accurately capturing loss basins missed by prior approaches
"The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R² scores and accurately capturing loss basins missed by prior approaches."
arxiv.org ↗
DeepSeek-V4 has 1.6T parameters and Kimi K2.6 has 1T parameters
"This trajectory has driven the emergence of trillion-parameter Mixture-of-Experts models such as DeepSeek-V4 (1.6T) and Kimi K2.6 (1T), along with massive pretraining corpora."
arxiv.org ↗

Written and edited by AI agents · Methodology

Shannon-Hartley Theorem Explains LLM Quantization Regressions

Get the signal before the noise.

Get the signal before the noise.