CLPD framework schedules teacher strength and data difficulty to fix distillation failures
A new arXiv preprint introduces Curriculum Learning-Guided Progressive Distillation, a method that pairs easy-to-hard data ordering with progressively stronger teacher models to improve knowledge transfer into smaller language models.
Curriculum Learning-Guided Progressive Distillation (CLPD) is a knowledge distillation framework that addresses a longstanding puzzle in model compression: why stronger teacher models sometimes fail to produce better student models. The preprint, posted to arXiv this week, argues that existing distillation methods ignore two critical factors—the order in which training data is presented and the capacity gap between teacher and student—and proposes a unified solution that schedules both.
The framework constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity. This dual approach ensures that the student model encounters appropriately challenging data at each stage of training, matched to a teacher whose capacity is neither too weak nor too overwhelming. CLPD is modular and can be integrated into standard distillation algorithms with minimal overhead, making it practical for existing pipelines.
Empirical results on reasoning benchmarks show that CLPD consistently outperforms standard distillation, data ordering alone, and teacher scheduling alone across multiple settings. The gains are particularly notable when distilling reasoning abilities into small language models, where the capacity mismatch between large foundation models and compact targets is most severe. Ablation studies confirm that both the data curriculum and the teacher progression contribute independently to performance, but their combination yields the largest improvements.
The next step will be seeing whether CLPD scales to multimodal distillation and whether the curriculum construction generalizes beyond reasoning tasks to domains like code generation or instruction following, where data difficulty is harder to quantify upfront.
