Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

arXiv cs.AI·Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo

5/22/2026

·~2 min·5/22/2026·en·1

Quick Answer

This paper establishes that Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) are conditionally equivalent, hinging on the assumption that RLHF policies prefer human-preferred responses.

Quick Take

This paper establishes that (DPO) and Reinforcement Learning from Human Feedback (RLHF) are conditionally equivalent, hinging on the assumption that RLHF policies prefer human-preferred responses. When this assumption fails, DPO can lead to undesirable outcomes, prompting the introduction of Constrained Preference Optimization (CPO) for guaranteed alignment. Experiments show CPO achieves state-of-the-art performance on standard benchmarks.

Key Points

DPO's equivalence to RLHF relies on the assumption of human-preferred responses.
Failure of this assumption can lead to pathological convergence in DPO.
CPO introduces constraints to ensure provable alignment with human preferences.
Theoretical analysis clarifies when DPO guarantees hold.
CPO demonstrates state-of-the-art performance on standard benchmarks.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 20 May 2026]

View PDF HTML (experimental)

Abstract:Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: this https URL.

Comments:	49 pages
Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2605.20834 [cs.AI]
	(or arXiv:2605.20834v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.20834 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zhiqin Yang [view email]
[v1] Wed, 20 May 2026 07:26:22 UTC (740 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Cheng Qian

1d ago

FeaturedOriginal

Information Limits and Attractor Dynamics in Economies of Frontier LLM Agents: A Pre-Registered Test

AI Summary

A pre-registered experiment on Claude Opus 4.8 investigates wealth growth and population misalignment in economies, revealing that relative growth aligns with claimed information but fails to demonstrate expected noise-maintained dispersion. The experiment cost $138.76 and is fully reproducible from cached outputs.

#LLM #Agent #Open Source #AI Startup

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.AI

Information Limits and Attractor Dynamics in Economies of Frontier LLM Agents: A Pre-Registered Test

Onnes: A Physics-Grounded LLM Simulator for Cryogenic Fault Diagnosis in Quantum Computing Infrastructure

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.AI

Information Limits and Attractor Dynamics in Economies of Frontier LLM Agents: A Pre-Registered Test

Onnes: A Physics-Grounded Multi-Agent LLM Simulator for Cryogenic Fault Diagnosis in Quantum Computing Infrastructure

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Onnes: A Physics-Grounded LLM Simulator for Cryogenic Fault Diagnosis in Quantum Computing Infrastructure