Adaptive teacher exposure lifts Qwen3 math reasoning by 2.33 points
New preprint shows that letting a lightweight controller adjust how much reference reasoning the teacher sees during training improves LLM self-distillation on math benchmarks.

Adaptive Teacher Exposure for Self-Distillation (ATESD) addresses a hidden bottleneck in on-policy self-distillation for large language models. Researchers Zihao Han, Tiangang Zhang, Huaibin Wang, and Yilun Sun argue that the standard practice of always showing the teacher model the full reference solution creates a mismatch: when the teacher conditions on reasoning steps far beyond what the student can currently absorb, the resulting token targets become too strong to learn from. A controlled sweep confirmed that full exposure is not reliably optimal and that student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning.
ATESD treats teacher exposure as a learnable training-time variable rather than a fixed hyperparameter. A lightweight Beta-policy controller samples a reveal ratio conditioned on compact training-state statistics, then holds that exposure for a short window of student updates. The controller is optimized with a discounted learning-progress reward that scores each decision by its effect on the student's future improvement rather than immediate loss change, addressing delayed credit assignment in on-policy distillation. Testing on AIME 24, AIME 25, and HMMT 25 across Qwen3-1.7B, Qwen3-4B, and Qwen3-8B shows ATESD consistently outperforms competitive self-distillation and reinforcement-learning baselines, improving over the OPSD baseline by +0.95, +2.05, and +2.33 Average@12 points respectively. All three model sizes saw gains, with the largest lift on the 8B checkpoint.