ADPO：锚定直接偏好优化 (ADPO: Anchored Direct Preference Optimization)

Anchored Direct Preference Optimization (ADPO) is a unified framework that generalizes Direct Preference Optimization (DPO) with soft preferences, reference-policy anchoring, and groupwise extensions. While standard DPO assumes hard binary labels and pairwise comparisons, ADPO introduces: (i) soft preference probabilities that encode uncertainty and mitigate gradient drift; (ii) arbitrary reference-policy anchors that stabilize training via groupwise shift invariance and implicit KL regularization; and (iii) listwise preference modeling through Plackett-Luce distributions. We prove that DPO, Bradley-Terry objectives, and Top-1-vs-Rest formulations emerge as special cases. ADPO yields three practical variants: pairwise anchored Soft-DPO, listwise anchored Soft-DPO with raw rewards, and KDE-based listwise smoothing for heavy-tailed noise. In contextual bandits, anchoring improves WinMass by 38-63% over standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed contamination (112% relative gain). In sequential reinforcement learning (CartPole, LunarLander), anchoring improves noisy-preference performance by 15-29%, confirming transfer from single-step to multi-step settings. Experiments with 10-256 parameter models provide clear guidance: use pairwise anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for extreme contamination.

翻译：锚定直接偏好优化（ADPO）是一个统一框架，它通过软偏好、参考策略锚定和分组扩展推广了直接偏好优化（DPO）。标准DPO假设硬二元标签和成对比较，而ADPO引入了：（i）编码不确定性并缓解梯度漂移的软偏好概率；（ii）通过分组平移不变性和隐式KL正则化稳定训练的任意参考策略锚点；（iii）基于Plackett-Luce分布的列表式偏好建模。我们证明DPO、Bradley-Terry目标函数和Top-1-vs-Rest公式均为其特例。ADPO衍生出三种实用变体：成对锚定软DPO、基于原始奖励的列表式锚定软DPO，以及针对重尾噪声的基于核密度估计（KDE）的列表式平滑方法。在上下文赌博机实验中，锚定机制将WinMass指标较标准DPO提升38-63%，而KDE平滑方法在重尾污染环境下达到0.68对比0.32的性能（相对增益112%）。在序列强化学习任务（CartPole、LunarLander）中，锚定机制将噪声偏好下的性能提升15-29%，证实了从单步到多步场景的迁移能力。基于10-256参数模型的实验提供了明确指导：在清洁或中度噪声环境下使用成对锚定软DPO，在极端污染场景下采用基于KDE的列表式ADPO。