Flow-DPPO replaces PPO ratio clipping with exact KL divergence for flow matching RL
Flow-DPPO, a new reinforcement learning method from Tencent Hunyuan researchers, computes exact KL divergence between policies in flow models instead of noisy single-sample ratio estimates, enabling stable multi-epoch training and better reward-KL efficiency.
Flow-DPPO is a reinforcement learning training method from Tencent Hunyuan researchers that replaces PPO-style ratio clipping with exact divergence constraints for flow matching models used in image and video generation. The preprint, released June 10, argues that ratio clipping—used in recent methods like Flow-GRPO and CPS—is structurally ill-suited for flow models because the probability ratio between old and new policies is a noisy, single-sample estimate of true policy divergence, leading to over-constraining in some trajectory regions and under-constraining in others.
The key insight is that per-step policies in flow models are Gaussian, which allows exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO uses an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show the method achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades.
Code and models are available on GitHub at github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO. Authors include Bowen Ping, Xiangxin Zhou, Penghui Qi, Minnan Luo, Liefeng Bo, and Tianyu Pang.







