ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

ResearchNSFW

DPO fails to match RLHF under common conditions, researchers prove with Constrained Preference Optimization fix

A preprint proves Direct Preference Optimization only matches RLHF when an implicit assumption holds—and when it fails, DPO can converge on policies that prefer dispreferred responses.

ByAlex Sokoloff·May 18, 2026

DPO fails to match RLHF under common conditions, researchers prove with Constrained Preference Optimization fix

Direct Preference Optimization (DPO) has become a go-to alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning language models, largely because practitioners believe the two methods are theoretically equivalent while DPO is simpler to implement. A new preprint from researchers at Imperial College London, Nanyang Technological University, and Hong Kong Baptist University proves that equivalence is conditional, not universal—and the condition is violated often enough in real training runs to produce pathological outcomes.

The paper, posted May 21, identifies an implicit assumption underlying DPO's claimed equivalence with RLHF: the RLHF-optimal policy must prefer human-preferred responses. When that assumption fails—something the authors show happens in practice—DPO optimizes relative advantage over a reference policy rather than absolute alignment with human preferences. The result is a failure mode where the training loss decreases while the policy converges on responses humans actually dispreferred. The authors characterize when the assumption is violated, prove DPO and RLHF optimize fundamentally different objectives in those cases, and show the existence of an undesirable solution space DPO can fall into.

On margin ranking and constraints

The paper provides a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets—a detail that explains why it can prefer the wrong responses when the implicit assumption breaks. To address the gap, the authors introduce Constrained Preference Optimization (CPO), which augments RLHF with explicit constraints to guarantee alignment. The method preserves DPO's implementation simplicity while offering provable guarantees.

Experiments on standard benchmarks show CPO reaches state-of-the-art performance. Code is available at github.com/visitworld123/CPO. The theoretical analysis establishes exactly when DPO's guarantees hold and when practitioners need a different approach.

ZenCreator

DPO fails to match RLHF under common conditions, researchers prove with Constrained Preference Optimization fix

On margin ranking and constraints

More in Research

Avito launches year-long Data Science Bootcamp with ML and NLP tracks

HuggingFace Jobs launches one-command vLLM deployment on H100 and A100

Gemma 4 voice AI hits sub-100ms latency on Cerebras wafer-scale chips

Hugging Face embeds 200+ benchmark scores directly on model cards

NVIDIA NeMo AutoModel cuts fine-tuning setup time for Llama, Mistral, Gemma