We present Anchored Direct Preference Optimization (ADPO), a policy alignment method derived from first principles of KL-regularized reinforcement learning. Unlike standard approaches that treat the reference policy merely as a regularizer, we show that the optimal policy in reinforcement learning from human feedback inherently operates in a differential coordinate system, optimizing relative advantage in the form of log ratios rather than absolute probabilities. ADPO explicitly parameterizes this optimal structure through anchored logits, effectively decoupling response quality from prior popularity and creating an implicit trust region through curvature scaling. We show that this formulation unifies supervised fine-tuning, reinforcement learning, and ranking-based objectives under a single geometric perspective. Theoretically, ADPO resolves the probability smearing problem of supervised fine-tuning while avoiding the mode-seeking instability characteristic of reverse-KL methods. Empirically, the listwise ranking variant of ADPO achieves state-of-the-art performance on reasoning tasks, outperforming GRPO by 30.9 percent on Qwen3-1.7B and demonstrating superior robustness under distribution shift.
翻译:我们提出锚定直接偏好优化(ADPO),一种从KL正则化强化学习基本原理推导出的策略对齐方法。与将参考策略仅视为正则化器的标准方法不同,我们证明了基于人类反馈的强化学习中的最优策略本质上运行在一个微分坐标系中,以对数比的形式优化相对优势,而非绝对概率。ADPO通过锚定逻辑值显式参数化这一最优结构,有效解耦了响应质量与先验流行度,并通过曲率缩放创建了一个隐式信任区域。我们展示了该公式将监督微调、强化学习和基于排序的目标统一在一个单一的几何视角下。理论上,ADPO解决了监督微调的概率弥散问题,同时避免了反向KL方法特有的模式寻求不稳定性。实证表明,ADPO的列表排序变体在推理任务上达到了最先进的性能,在Qwen3-1.7B上优于GRPO 30.9%,并在分布偏移下表现出卓越的鲁棒性。