Anchored Direct Preference Optimization (ADPO) is a unified framework that generalizes Direct Preference Optimization (DPO) with soft preferences, reference-policy anchoring, and groupwise extensions. While standard DPO assumes hard binary labels and pairwise comparisons, ADPO introduces: (i) soft preference probabilities that encode uncertainty and mitigate gradient drift; (ii) arbitrary reference-policy anchors that stabilize training via groupwise shift invariance and implicit KL regularization; and (iii) listwise preference modeling through Plackett-Luce distributions. We prove that DPO, Bradley-Terry objectives, and Top-1-vs-Rest formulations emerge as special cases. ADPO yields three practical variants: pairwise anchored Soft-DPO, listwise anchored Soft-DPO with raw rewards, and KDE-based listwise smoothing for heavy-tailed noise. In contextual bandits, anchoring improves WinMass by 38-63% over standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed contamination (112% relative gain). In sequential reinforcement learning (CartPole, LunarLander), anchoring improves noisy-preference performance by 15-29%, confirming transfer from single-step to multi-step settings. Experiments with 10-256 parameter models provide clear guidance: use pairwise anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for extreme contamination.
翻译:锚定直接偏好优化(ADPO)是一个统一框架,它通过软偏好、参考策略锚定和分组扩展推广了直接偏好优化(DPO)。标准DPO假设硬二元标签和成对比较,而ADPO引入了:(i)编码不确定性并缓解梯度漂移的软偏好概率;(ii)通过分组平移不变性和隐式KL正则化稳定训练的任意参考策略锚点;(iii)基于Plackett-Luce分布的列表式偏好建模。我们证明DPO、Bradley-Terry目标函数和Top-1-vs-Rest公式均为其特例。ADPO衍生出三种实用变体:成对锚定软DPO、基于原始奖励的列表式锚定软DPO,以及针对重尾噪声的基于核密度估计(KDE)的列表式平滑方法。在上下文赌博机实验中,锚定机制将WinMass指标较标准DPO提升38-63%,而KDE平滑方法在重尾污染环境下达到0.68对比0.32的性能(相对增益112%)。在序列强化学习任务(CartPole、LunarLander)中,锚定机制将噪声偏好下的性能提升15-29%,证实了从单步到多步场景的迁移能力。基于10-256参数模型的实验提供了明确指导:在清洁或中度噪声环境下使用成对锚定软DPO,在极端污染场景下采用基于KDE的列表式ADPO。