SafeDiffusion-R1 cuts unsafe image generation to 18% using online reward steering
A new post-training method uses online reinforcement learning and CLIP embedding steering to reduce unsafe content in diffusion models without supervised data or catastrophic forgetting.
SafeDiffusion-R1 is a post-training framework that reduces unsafe content generation in diffusion models through online reinforcement learning. Researchers at Mohamed bin Zayed University of AI and the University of Warwick apply Group Relative Policy Optimization (GRPO) to both negative and positive text prompts while steering CLIP text embeddings toward safe directions in the embedding space, eliminating the need for expensive supervised image-text pairs or specialized reward model fine-tuning.
The approach addresses two longstanding problems in diffusion safety work: data scarcity and catastrophic forgetting. Existing methods require either unsafe-text paired with safe-image ground truth or negative/positive image pairs, which are costly to collect at scale. Offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from quality degradation when the model forgets its original capabilities. SafeDiffusion-R1's online-policy design lets the model learn from diverse prompts, including explicit unsafe content, without losing generation quality.
Tested on Stable Diffusion v1.4, the method reduced inappropriate content to 18.07% (down from 48.9% baseline) and nudity detections to 15 (versus 646 in the unmodified model). Compositional generation quality on GenEval improved from 42.08% to 47.83%. The safety gains generalize to out-of-domain unsafe prompts across seven harm categories without requiring supervised paired data or reward tuning. Code is available on GitHub.
