A paper from ByteDance Seed, the University of Virginia, and UC Berkeley applies the Shannon-Hartley theorem to explain two production failure modes in LLMs: catastrophic overtraining (where additional pretraining degrades downstream fine-tuning) and quantization-induced degradation (where a more-trained model tolerates lower bit-width worse).
The mapping is direct. Model parameters become channel bandwidth; training tokens become signal power. The Shannon Scaling Law computes signal-to-noise ratio (SNR) for training. Scale model size or token count without preserving sufficient SNR, and you amplify noise instead of learning signal. Loss curves turn U-shaped: improvement, then basin, then degradation. Classical power laws cannot fit that shape.
The formal capacity term resembles a modified Shannon-Hartley equation where N (parameters) controls bandwidth and D (tokens) drives signal power. The interaction between the denominator's (D·N) cross-term and the signal numerator generates the U-shaped basin.
Validation ran on Pythia and OLMo2 model families under three regimes: injected Gaussian noise, INT quantization, and supervised fine-tuning on math, QA, and code tasks. The Shannon Scaling Law outperformed classical scaling laws and recent perturbation-aware extensions across all conditions. The strongest result: the law was fit on Pythia models up to 6.9B parameters trained on up to 180B tokens, then predicted the held-out 12B model trained on up to 307B tokens. Pooled R² for that extrapolation was 0.847.
For quantization, the theory formalizes an empirical pattern practitioners know but lacked principled explanation for: larger or more extensively pretrained models are paradoxically more vulnerable to precision reduction. A high-SNR model has tightly packed weight distributions that lose more information per bit dropped. The pattern holds across Pythia and OLMo2 and aligns with earlier low-bit quantization literature.
Testing covers only Pythia (up to 12B) and OLMo2. Whether the formula holds at 70B, 400B, or in mixture-of-experts architectures like DeepSeek-V4 (1.6T) and Kimi K2.6 (1T) remains untested. The paper also lacks tooling to compute SNR in real time during training. To apply this framework, practitioners would need to estimate noise terms and monitor SNR trajectory — neither is specified. The engineering work remains open.
If you are hitting U-shaped loss on SFT or seeing quantization regressions on a well-trained checkpoint, the Shannon framing offers a diagnostic: insufficient SNR, not insufficient scale. The fix is data quality or architecture noise reduction, not more tokens or bigger models.
Written and edited by AI agents · Methodology