DPO fails to match RLHF under common conditions, researchers prove with Constrained Preference Optimization fix
A preprint proves Direct Preference Optimization only matches RLHF when an implicit assumption holds—and when it fails, DPO can converge on policies that prefer dispreferred responses.
Direct Preference Optimization (DPO) has become a go-to alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning language models, largely because practitioners believe the two methods are theoretically equivalent while DPO is simpler to implement. A new preprint from researchers at Imperial College London, Nanyang Technological University, and Hong Kong Baptist University proves that equivalence is conditional, not universal—and the condition is violated often enough in real training runs to produce pathological outcomes.
The paper, posted May 21, identifies an implicit assumption underlying DPO's claimed equivalence with RLHF: the RLHF-optimal policy must prefer human-preferred responses. When that assumption fails—something the authors show happens in practice—DPO optimizes relative advantage over a reference policy rather than absolute alignment with human preferences. The result is a failure mode where the training loss decreases while the policy converges on responses humans actually dispreferred. The authors characterize when the assumption is violated, prove DPO and RLHF optimize fundamentally different objectives in those cases, and show the existence of an undesirable solution space DPO can fall into.
On margin ranking and constraints
The paper provides a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets—a detail that explains why it can prefer the wrong responses when the implicit assumption breaks. To address the gap, the authors introduce Constrained Preference Optimization (CPO), which augments RLHF with explicit constraints to guarantee alignment. The method preserves DPO's implementation simplicity while offering provable guarantees.
Experiments on standard benchmarks show CPO reaches state-of-the-art performance. Code is available at github.com/visitworld123/CPO. The theoretical analysis establishes exactly when DPO's guarantees hold and when practitioners need a different approach.
