Flow-DPPO replaces PPO ratio clipping with exact KL divergence for flow matching RL

Flow-DPPO, a new reinforcement learning method from Tencent Hunyuan researchers, computes exact KL divergence between policies in flow models instead of noisy single-sample ratio estimates, enabling stable multi-epoch training and better reward-KL efficiency.

ByAlex Sokoloff·June 10, 2026

Flow-DPPO replaces PPO ratio clipping with exact KL divergence for flow matching RL

Flow-DPPO is a reinforcement learning training method from Tencent Hunyuan researchers that replaces PPO-style ratio clipping with exact divergence constraints for flow matching models used in image and video generation. The preprint, released June 10, argues that ratio clipping—used in recent methods like Flow-GRPO and CPS—is structurally ill-suited for flow models because the probability ratio between old and new policies is a noisy, single-sample estimate of true policy divergence, leading to over-constraining in some trajectory regions and under-constraining in others.

The key insight is that per-step policies in flow models are Gaussian, which allows exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO uses an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show the method achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades.

Code and models are available on GitHub at github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO. Authors include Bowen Ping, Xiangxin Zhou, Penghui Qi, Minnan Luo, Liefeng Bo, and Tianyu Pang.

ZenCreator

Flow-DPPO replaces PPO ratio clipping with exact KL divergence for flow matching RL

More in Releases

PAJAMA distills LLM judges into programs, cuts eval cost by 100×

Molt: NVIDIA's PyTorch framework cuts agentic RL iteration cost

Hypernetworks outscale LoRA for train-time knowledge injection in LLMs

Staleness-Adaptive Trust Region cuts asynchronous RL performance loss to 3% at 8× policy lag

Distilled RL transfers knowledge across model families without unconditional imitation