CausalMix scales LLM data mixtures from 0.5B to 7B without retraining
New preprint frames LLM data mixing as a causal inference problem, fitting a model on 512 Qwen2.5-0.5B runs to extrapolate optimal mixtures for larger pools and 7B models.

CausalMix, a new framework for optimizing training-data proportions, treats data mixture as a causal inference problem rather than a static optimization task. Researchers fit a causal model on 512 training runs of Qwen2.5-0.5B to estimate the Conditional Average Treatment Effect (CATE) of different domain mixtures, then extrapolate the learned mixture to an 800,000-example data pool and apply it to train a 7B model. This sidesteps the costly retraining that prior proxy-model methods require when the underlying data pool shifts.
The framework treats statistical features of the data pool as covariates and the domain mixture as the treatment, isolating confounding biases to infer state-dependent optimal mixtures. Experiments on Qwen3-4B-Base with long chain-of-thought data show consistent downstream-task improvements over RegMix and other baselines. The preprint, authored by Zinan Tang, Yukun Zhang, Shaomian Zheng, Zhuoshi Pan, Qizhi Pei, and Dingnan Jin, posted July 2, 2026, includes a CATE Interpreter for visual analysis of the learned strategy.




