d-OPSD cuts diffusion LLM training steps to 10 percent of RLVR baseline
A new self-distillation method aligns training with the iterative denoising process in diffusion language models, beating RLVR on reasoning benchmarks while using a tenth of the optimization steps.

d-OPSD is a self-distillation technique that solves a structural mismatch between classical online policy self-distillation methods and diffusion language models. The approach trains dLLMs on their own generated outputs while shifting control from token-level predictions to training steps, synchronizing the learning process with the model's iterative denoising architecture.
Diffusion language models generate text through iterative refinement rather than token-by-token prediction. That architectural difference breaks traditional self-distillation methods, which rely on token-level feedback loops. d-OPSD implements a suffix condition—training on the model's own generated answers—and moves the optimization control point from individual tokens to training steps that align with the denoising process.
On reasoning benchmarks, d-OPSD outperforms both RLVR and supervised fine-tuning while requiring only 10 percent of RLVR's optimization steps. For practitioners fine-tuning diffusion LLMs, the 90 percent reduction in training steps translates directly to lower compute costs, a meaningful difference for teams running experiments on consumer hardware or limited cloud budgets. The method is available on GitHub.

