HDS scheduler cuts LLM pre-training iterations 44% with multi-objective RL
Holistic Data Scheduler uses reinforcement learning to dynamically adjust training data mixtures across quality, loss, and weight-norm dimensions, reaching target perplexity in half the steps on The Pile benchmark.

Holistic Data Scheduler (HDS) is a reinforcement learning framework from researchers Chenhao Dang, Jing Ma, and Mingjie Liao that cuts large language model pre-training time by dynamically rebalancing data sources during training. The preprint describes HDS as a continuous-control RL problem solved with the Soft Actor-Critic algorithm, which adjusts the mix of training domains—Wikipedia, GitHub, academic papers, web text—on the fly rather than fixing proportions at the start. On The Pile benchmark, HDS reached the final validation perplexity of the next-best static mixing method in 44 percent fewer training iterations, and it delivered a 7.2 percent gain on the MMLU zero-shot task alongside consistent improvements on other downstream evaluations.
The core innovation is a three-part reward function that simultaneously optimizes data quality (perplexity on held-out validation sets per domain), inter-domain influence (how adjusting one domain's weight affects loss on others), and model health (L2 norms of weight updates, which flag overfitting or instability). Existing online data mixing techniques typically optimize a single metric—usually validation loss—and ignore the feedback loops between domains or the risk of saturating on high-quality but narrow sources. HDS treats those dimensions as co-equal objectives, letting the RL agent explore trade-offs that static heuristics miss. The authors ran ablations on models from 125M to 1.3B parameters and found the multi-objective reward consistently outperformed single-objective baselines, with the largest gains appearing in the 350M–1.3B range where domain interference is most pronounced.
The paper does not release trained weights or a public HDS implementation, so practitioners cannot yet drop the scheduler into their own pre-training runs. The experiments used The Pile's standard splits and a fixed compute budget per model size, which leaves open questions about how HDS behaves on proprietary or domain-specific corpora, whether the 44 percent iteration savings hold at tens of billions of parameters, and how sensitive the three reward coefficients are to hyperparameter search. If the authors or a third party open-source the scheduler code and demonstrate similar gains on a frontier-scale model, HDS could become a standard component of pre-training pipelines—particularly for teams that cannot afford to waste tokens on suboptimal mixing schedules.



