vOPD stabilizes on-policy distillation with closed-form KL baseline
Researchers introduce a control variate baseline for on-policy distillation that reduces gradient variance without adding inference overhead or biasing the objective.

vOPD (On-Policy Distillation with a control variate baseline) is a stabilization technique for on-policy distillation (OPD), detailed in a preprint released this week. The paper addresses a core training instability in OPD — high gradient variance from single-sample Monte Carlo estimation — by introducing a control variate baseline drawn from reinforcement learning literature.
The baseline takes the form of a per-token negative reverse KL divergence between student and teacher models, computed directly from the forward pass with no additional critic network or inference calls. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant computational overhead, or restrict the calculation to a top-k support, which biases the training objective. vOPD preserves the lightweight single-sample estimator of vanilla OPD while subtracting the value function as a detached baseline, keeping the gradient unbiased while reducing variance.
On benchmarks
Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline. The authors show that a top-k approximation of the baseline further lowers cost without compromising performance. The value function admits a closed form, available from the already-computed forward pass — the key efficiency gain over prior methods.
Authors Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, and Yohan Jo frame OPD as policy-gradient RL, making the control variate approach a natural fit for a post-training paradigm that has gained traction in reasoning domains but remains unstable in practice.