NormGuard preserves image quality in flow-model RL fine-tuning by capping velocity inflation
A new training constraint preserves image quality during reinforcement learning post-training of flow-based generators by penalizing velocity norm inflation above a reference baseline.

NormGuard is a training-time penalty that addresses a structural defect in reinforcement learning post-training for flow-based image generators. Researchers identified a consistent signature of quality drift: across three RL fine-tuning methods—NFT, AWM, and DPO—the per-step velocity norm inflates by 5% to 15% relative to the reference model. That inflation co-adapts into the model weights, meaning inference-time rescaling fails to recover perceptual quality or improve reward. An adjoint sensitivity analysis confirms that velocity magnitude rescaling carries no coherent first-order reward signal at the batch level, so suppressing the inflation does not discard a reward-carrying component.
The authors propose a hinge penalty that activates only when velocity exceeds the reference norm and composes additively with any velocity-local base loss. Across two base models, three post-training methods, and two reward proxies, NormGuard consistently improves MLLM-judged image quality and forensic realism while preserving reward. The gains amplify under few-step inference and are not explained by early stopping, according to the preprint posted June 29, 2026.




