On-policy distillation's 3× speedup stems from early trajectory alignment, new research shows
New research reveals on-policy distillation works by predicting final model weights early in training, enabling a plug-and-play acceleration method that maintains performance while cutting training time threefold.

On-policy distillation achieves its efficiency not through denser supervision alone, but by learning to "foresee" the final model state early in training—a finding that opens the door to 3× faster post-training runs without sacrificing performance.
A preprint posted to HuggingFace on May 18 by Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, and Kai Yang argues that OPD's advantage over traditional fine-tuning lies in two parameter-level behaviors. At the module-allocation level, the method identifies low-utility regions and focuses updates on reasoning-critical modules. At the update-direction level, dominant subspaces align with the final update trajectory much earlier than competing methods. The authors trace both phenomena through training runs on large language models and show that OPD's update path stabilizes long before convergence.
Building on those observations, the team proposes EffOPD, a plug-and-play acceleration technique that adaptively selects an extrapolation step size and moves along the current update direction. The method requires no additional trainable modules or hyperparameter tuning. Across benchmarks, EffOPD delivered an average 3× training speedup while matching the final performance of standard OPD runs.
The findings suggest that OPD's efficiency is less about the density of the training signal and more about the geometry of the optimization path—a parameter-dynamics perspective that could inform the design of future post-training methods for large language models.