Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

翻译：直接偏好优化（DPO）及其变体正日益广泛地用于使语言模型与人类偏好对齐。尽管这些方法旨在教导模型更频繁地生成偏好响应而非非偏好响应，但先前的研究观察到，在训练过程中偏好响应的似然常常会下降。本研究揭示了这一反直觉现象（我们称之为似然位移）的成因与影响。我们证明，似然位移可能是灾难性的，它会将概率质量从偏好响应转移到语义相反的响应上。一个简单的例子是：训练模型使其偏好 $\texttt{No}$ 而非 $\texttt{Never}$，可能会急剧增加 $\texttt{Yes}$ 的概率。此外，在对齐模型以拒绝不安全提示时，我们表明这种位移可能无意中导致失配，因为它会将概率质量从偏好的拒绝响应转移到有害响应上（例如，将 Llama-3-8B-Instruct 的拒绝率从 74.4% 降低至 33.4%）。我们从理论上刻画了似然位移是由诱导出相似嵌入的偏好所驱动的，这一相似性通过中心化隐藏嵌入相似度（CHES）得分来度量。实证上，CHES 得分能够识别出给定数据集中哪些训练样本对似然位移贡献最大。在我们的实验中，过滤掉这些样本有效地缓解了无意的失配。更广泛地说，我们的结果凸显了筛选具有足够差异性偏好的数据的重要性，对此我们相信 CHES 得分可能具有重要价值。