MixSD fine-tuning retains 100% of base model capability while learning new facts
A new self-distillation method lets language models absorb new facts without catastrophic forgetting by mixing tokens from expert and naive conditionals of the base model itself.

MixSD, a fine-tuning technique from researchers across multiple institutions, addresses catastrophic forgetting during supervised fine-tuning by aligning new training targets with the model's native generation distribution. The method retains up to 100 percent of a base model's held-out capability while maintaining near-perfect training accuracy on new facts—whereas standard supervised fine-tuning retains as little as 1 percent.
Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. This approach requires no external teacher model. The authors tested MixSD on synthetic corpora for factual recall and arithmetic function acquisition, plus open-domain question answering and knowledge editing benchmarks. Across multiple model scales, MixSD consistently outperformed standard supervised fine-tuning and on-policy self-distillation baselines on the memorization-retention trade-off, producing substantially lower negative log-likelihood supervision targets and reducing harmful movement along Fisher-sensitive parameter directions.
The preprint was posted to HuggingFace Papers on May 19, 2026, authored by Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, and Weihao Xuan.