Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed likelihood displacement, where the absolute probability of preferred responses decreases simultaneously during training. We demonstrate that such displacement can lead to a more devastating failure mode, which we defined as \textit{Catastrophic Preference Shift}, where the lost preference probability mass inadvertently shifts toward out-of-distribution (OOD) responses. Such a failure mode is a key limitation shared across BT-style direct preference learning methods, due to the fundamental conflict between the unconstrained discriminative alignment and generative foundational capabilities, ultimately leading to severe performance degradation (e.g., SimPO suffers a significant drop in reasoning accuracy from 73.5\% to 37.5\%). We analyze existing BT-style methods from the probability evolution perspective and theoretically prove that these methods exhibit over-reliance on model initialization and can lead to preference shift. To resolve these counter-intuitive behaviors, we propose a theoretically grounded Stable Preference Optimization (SPO) framework that constrains preference learning within a safe alignment region. Empirical evaluations demonstrate that SPO effectively stabilizes and enhances the performance of existing BT-style preference learning methods. SPO provides new insights into the design of preference learning objectives and opens up new avenues towards more reliable and interpretable language model alignment.

翻译：直接偏好学习已成为偏好优化领域主流的离线范式。多数此类方法基于Bradley-Terry（BT）成对偏好排序模型，直接将语言模型与人类偏好对齐。先前研究观察到一种反直觉现象——似然位移，即在训练过程中偏好响应的绝对概率同步下降。我们证明这种位移可能导致更严重的失效模式，即“灾难性偏好偏移”——损失的偏好概率质量会无意中转移到分布外（OOD）响应。这种失效模式是BT式直接偏好学习方法的共性关键局限，其根源在于无约束判别式对齐与生成式基础能力之间的根本冲突，最终导致严重的性能退化（例如SimPO的推理准确率从73.5%骤降至37.5%）。我们从概率演化视角分析现有BT式方法，并从理论上证明这些方法存在对模型初始化的过度依赖，且可能引发偏好偏移。为解决这些反直觉现象，我们提出理论完备的稳定偏好优化（SPO）框架，将偏好学习约束在安全对齐区域内。实证评估表明，SPO能有效稳定并提升现有BT式偏好学习方法的性能。该框架为偏好学习目标的设计提供了新视角，并为实现更可靠、可解释的语言模型对齐开辟了新路径。