Loading…

ξ-DPO eliminates hyperparameter tuning in preference optimization with ratio reward margins | UncensoredHub

ResearchNSFW

ξ-DPO eliminates hyperparameter tuning in preference optimization with ratio reward margins

New arXiv preprint proposes ξ-DPO, a direct preference optimization method that replaces SimPO's joint hyperparameter tuning with a single bounded margin derived from initial reward gap distributions.

May 13, 2026

ξ-DPO eliminates hyperparameter tuning in preference optimization with ratio reward margins

ξ-DPO is a preference optimization method that simplifies model alignment by replacing SimPO's dual hyperparameter tuning with a single interpretable margin. The preprint, posted to arXiv on May 13, 2026, argues that SimPO's joint tuning of β and γ creates unnecessary complexity across datasets with different reward structures.

The core problem: SimPO's margin formulation lacks interpretability when datasets have different reward gap distributions. Through analysis, the authors found that β implicitly controls sample filtering while γ's effect depends entirely on dataset-specific reward distributions. This forces practitioners to grid-search both parameters together—a costly process.

ξ-DPO reformulates the preference objective by redefining reward as a ratio between chosen and rejected responses. This ratio form cancels β's effect entirely and produces a bounded margin (denoted ξ) that can be determined directly from the initial reward gap distribution. The optimization target shifts from maximizing likelihood of reward gaps to minimizing distance between reward gaps and optimal margins.

On the margin

Unlike SimPO's γ, ξ explicitly represents the desired relative separation between chosen and rejected responses. Because it is bounded and interpretable, practitioners can set it based on the dataset's initial reward distribution without repeated trial-and-error. The reformulation aligns the optimization objective more directly with the goal of separating preferred responses from rejected ones.

ξ-DPO maintains SimPO's efficiency—it eliminates the explicit reference model that earlier RLHF approaches required—while removing the hyperparameter tuning burden. Full technical details and experimental results are available in the arXiv preprint.

On the margin

More in Research