Shannon-Hartley theorem unifies LLM scaling, overfitting, and quantization collapse
New arXiv preprint models neural network capacity using Shannon-Hartley theorem, unifying monotonic scaling with catastrophic overfitting and quantization collapse under a single information-theoretic framework.

A team led by Xu Ouyang has published a preprint that reframes large language model scaling through the lens of classical information theory. The paper, "LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws," treats the training process as signal transmission over a noisy channel—parameters become bandwidth, training tokens become signal power, and the Shannon-Hartley capacity theorem sets hard limits on what any model can learn.
The Shannon Scaling Law unifies two behaviors that standard power-law scaling cannot explain: the U-shaped loss curve that appears when a model is overtrained on finite data, and the capacity collapse that follows aggressive quantization. Both are emergent consequences of cumulative noise—data noise, inter-component interference, and architectural constraints—that the new framework models explicitly. The authors demonstrate predictive accuracy across multiple model families and training regimes, including scenarios where classical Chinchilla-style laws break down.
What stands out
- 01Hard capacity ceiling. The Shannon bound gives a finite information ceiling for any parameter-token budget. Beyond that ceiling, additional pretraining or lower-bit quantization destroys capacity rather than preserving it. Practitioners gain a mathematical stopping rule instead of guessing when overfitting begins.
- 02Unified view of non-monotonic loss. Catastrophic overfitting and quantization degradation both emerge from the same noise term in the capacity equation. The paper shows that a 4-bit quantized 70B model and a 13B model overtrained for 3× the optimal token count hit the same information bottleneck, just through different noise sources.
- 03Resource allocation math. The framework lets teams calculate the marginal information gain from an extra trillion tokens versus an extra 10 billion parameters. When the noise floor rises faster than the signal, throwing more compute at the problem yields negative returns—a prediction classical scaling laws miss entirely.

