On-policy self-distillation cuts LLM safety tax by 8.85 points on small models
New arXiv preprint shows training language models on their own safety rollouts—rather than external demonstrations—preserves reasoning ability while improving robustness to harmful queries, with the largest gains on sub-2B parameter scales.
Safety alignment typically forces a tradeoff: models become more robust to harmful queries but lose reasoning ability in the process. A new preprint identifies off-policy training as a key culprit and proposes on-policy self-distillation for safety alignment (OPSA), a method that trains models on their own sampled trajectories rather than external demonstrations.
OPSA works by having the model generate its own rollouts and receive dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. The authors introduce "teacher flip rate"—a metric that measures how often a privileged context converts unsafe responses into safe ones—to search for contexts that activate latent safety reasoning rather than merely elicit safe-looking outputs. Across two reasoning-model families and five model scales, OPSA achieves a stronger safety-reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning.
What stands out
- 01Largest gains on small models. R1-Distill-1.5B sees a +8.85 point improvement and Qwen3-0.6B gains +5.49 points compared to off-policy baselines. The effect persists but shrinks at larger scales.
- 02Teacher flip rate as a search signal. Instead of assuming any safety prompt works, the authors measure how often a privileged context actually flips an unsafe rollout to a safe one, then optimize for contexts that do.
- 03Token-level concentration. OPSA concentrates gradient updates near early compliance-decision tokens—the first few tokens where the model commits to refusing or complying—preserving general reasoning downstream.
- 04Robustness to adaptive jailbreaks. The gains hold across training-set sizes and adaptive jailbreak evaluations, suggesting the method doesn't just memorize safe demonstrations.
