Safety-Aware Denoiser cuts unsafe text diffusion outputs without retraining
A new arXiv preprint introduces SAD, an inference-time framework that modifies the denoising loop in text diffusion models to steer outputs toward safe regions without retraining the base model.
Safety-Aware Denoiser (SAD), a new inference-time safety framework for text diffusion models, modifies the iterative denoising process to steer final outputs toward provably safe regions of the text space. Unlike existing safety approaches built for autoregressive models—which rely on post-hoc filtering or token-level constraints—SAD intervenes during denoising itself, avoiding expensive retraining of the underlying diffusion model while offering flexible, lightweight safety guidance.
The researchers evaluated SAD across three safety dimensions: hazard taxonomy compliance, memorization of training data, and resistance to jailbreak prompts. Results show SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. The preprint, posted May 12, 2026, is available on arXiv as 2605.08116v1.
