Direct Preference Optimization (DPO) is an efficient alternative to reinforcement learning from human feedback (RLHF), yet it typically assumes hard binary labels and pairwise comparisons. Such assumptions can be brittle under noisy or distribution-shifted supervision. We present Anchored Direct Preference Optimization (ADPO), which (i) incorporates soft preference probabilities, (ii) aligns policy updates through reference anchoring that induces an implicit trust region, and (iii) extends to listwise learning via Plackett-Luce modeling. In controlled synthetic setups covering 12 scenarios (4 noise types x 3 severities) and 3 model scales, ADPO exhibits relative improvements ranging from 12% to 79% over a standard DPO baseline (10-seed means; 95% CIs in the Appendix). Hard labels tend to fare better under severe noise, whereas soft labels yield better calibration under distribution shift; listwise variants achieve the highest WinMass (expected probability mass on the ground-truth best item) in 9/12 scenarios. Larger models amplify ADPO's benefits (0.718 vs. 0.416 at hidden=256), suggesting that anchoring acts as an effective trust-region regularizer. We release code and configurations to facilitate reproducibility.
翻译:直接偏好优化(DPO)是强化学习人类反馈(RLHF)的一种高效替代方法,但它通常假设存在硬二元标签和成对比较。此类假设在噪声或分布偏移的监督下可能较为脆弱。我们提出锚定直接偏好优化(ADPO),该方法(i)融合了软偏好概率,(ii)通过引入隐式信任区域的参考锚定机制对齐策略更新,以及(iii)通过Plackett-Luce建模扩展至列表式学习。在涵盖12种场景(4种噪声类型 × 3种严重程度)和3种模型规模的受控合成实验中,ADPO相较于标准DPO基线显示出12%至79%的相对性能提升(10次随机种子均值;95%置信区间见附录)。硬标签在严重噪声下往往表现更佳,而软标签在分布偏移时能实现更好的校准;列表式变体在9/12的场景中取得了最高的WinMass(真实最佳项上的期望概率质量)。更大规模的模型会放大ADPO的优势(隐藏维度256时:0.718 vs. 0.416),表明锚定机制可作为有效的信任区域正则化器。我们公开了代码与配置以促进可复现性。