Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension. DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response. For these pairs, DPO prematurely attenuates the gradient as soon as the policy margin ($Δ_θ$) merely beats the reference margin ($Δ_{\mathrm{ref}}$) even if the policy is still wrong ($Δ_θ<0$). We name this failure premature satisfaction, which is a concrete form of the training-inference mismatch. Reference-free objectives remove this mismatch by optimizing the absolute margin, but at the cost of discarding the stabilizing signal of the reference. We mitigate this tension with Hybrid-DPO (HyPO), a drop-in modification to DPO that applies reference conditionally: HyPO behaves exactly like DPO when the reference is optimistic or neutral, and it treats the reference as neutral when it is pessimistic by replacing $Δ_θ-Δ_{\mathrm{ref}}$ with $Δ_θ-\max\{0,Δ_{\mathrm{ref}}\}$. This one-line change strictly strengthens per-example learning signals on pessimistic pairs while preserving DPO's objective form and computational cost. By conditionally debiasing the pessimistic reference signal, HyPO mitigates premature satisfaction; empirically, across preference alignment, HyPO improves inference-aligned metrics and achieves higher pairwise win rates. Our results provide evidence that direct preference alignment could be enhanced by conditionally debiasing the reference signal, rather than discarding it.
翻译:直接偏好优化(DPO)已成为大型语言模型离线偏好对齐的事实标准,但其对参考策略的依赖引入了一个关键矛盾。DPO相对于参考模型对每次更新进行加权,通过将更新限制在可信区域内来稳定训练。这种依赖对于悲观样本对(即参考模型更偏好被拒绝的响应)会产生问题。对于这些样本对,一旦策略边际($Δ_θ$)仅超过参考边际($Δ_{\mathrm{ref}}$),即使策略仍然错误($Δ_θ<0$),DPO也会过早地衰减梯度。我们将这种失效称为过早满足,它是训练-推断失配的一种具体表现形式。无参考目标通过优化绝对边际消除了这种失配,但代价是丢弃了参考的稳定化信号。我们通过混合DPO(HyPO)缓解了这一矛盾,这是一种对DPO的直接修改方案,其有条件地应用参考:当参考为乐观或中性时,HyPO的行为与DPO完全相同;当参考为悲观时,则通过将$Δ_θ-Δ_{\mathrm{ref}}$替换为$Δ_θ-\max\{0,Δ_{\mathrm{ref}}\}$来将参考视为中性。这一单行修改严格增强了悲观样本对上的逐样本学习信号,同时保留了DPO的目标形式和计算成本。通过有条件地消除悲观参考信号的偏差,HyPO缓解了过早满足问题;实验表明,在偏好对齐任务中,HyPO改善了推断对齐指标并获得了更高的成对胜率。我们的结果证明,直接偏好对齐可以通过有条件地消除参考信号的偏差而非完全丢弃它来得到增强。