InfoLaw predicts LLM pretraining loss across quality mixtures and repetition with 0.15% error
A new scaling framework models pretraining as information accumulation, predicting loss across varied data quality and overtraining regimes up to 7B parameters and 425B tokens.
A team of researchers introduced InfoLaw, a data-aware scaling framework that predicts large language model loss from consumed tokens, model size, data mixture weights, and repetition levels. The preprint addresses a persistent gap in scaling laws: standard formulations fail to extrapolate reliably when data quality varies or when datasets are repeated during overtraining.
InfoLaw reframes pretraining as information accumulation, where quality controls information density and repetition induces scale-dependent diminishing returns. The authors trained models on datasets that varied in scale, quality distribution, and repetition level, then built an information-based model to predict performance. The framework achieved 0.15 percent mean absolute error and 0.96 percent maximum absolute error in loss prediction on unseen data recipes and larger runs — up to 7 billion parameters and 425 billion tokens.
What stands out
- 01Sub-1% extrapolation accuracy: InfoLaw predicted loss on held-out data mixtures and scale points with mean error of 0.15%, including runs that combined high-quality upweighting with multiple epochs of repetition.
- 02Overtraining regime modeling: In data-limited scenarios where repetition is unavoidable, the framework reliably models the interaction between mixture weights and diminishing returns from repeated data — a regime where prior scaling laws break down.
- 03Practical validation range: Testing extended to 7B-parameter models trained on 425B tokens, covering the scale and overtraining levels common in open-weight pretraining.
- 04Data-recipe selection under budget: The framework enables practitioners to compare candidate data mixtures under fixed compute budgets, predicting which combination of quality upweighting and repetition will yield the lowest loss at target scale.
