DiffusionOPD trains multi-task image models without cross-task interference
New framework trains task-specific teacher models independently, then distills them into a single student along its own rollout trajectories, avoiding cross-task interference.

DiffusionOPD, a multi-task training framework for diffusion models, addresses a core limitation in reinforcement learning for text-to-image generation: existing RL approaches optimize one task at a time or struggle with cross-task interference when training jointly. Researchers propose decoupling single-task exploration from multi-task integration by first training separate teacher models for each objective, then distilling their capabilities into a unified student model along the student's own sampling trajectories. This avoids the optimization burden of solving all tasks from scratch while preventing catastrophic forgetting that plagues cascade RL approaches.
The framework extends Online Policy Distillation from discrete token spaces to continuous-state Markov processes, deriving a closed-form per-step KL objective that works for both stochastic SDE and deterministic ODE sampling paths via mean-matching. The authors report that this analytic gradient produces lower variance and better generalization than conventional PPO-style policy gradients used in prior diffusion RL work. In experiments across all evaluated benchmarks, DiffusionOPD consistently outperformed multi-reward RL and cascade RL baselines in both training efficiency and final performance, achieving state-of-the-art results.