Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

翻译：直接偏好优化（DPO）及其变体正日益广泛地用于将语言模型与人类偏好对齐。尽管这些方法旨在教导模型更频繁地生成偏好响应而非非偏好响应，但先前工作观察到，在训练过程中偏好响应的似然值常常下降。本研究揭示了这一反直觉现象（我们称之为似然位移）的成因与影响。我们证明似然位移可能具有灾难性，会将概率质量从偏好响应转移到语义相反的响应上。一个简单的例子是：训练模型使其偏好 $\texttt{No}$ 而非 $\texttt{Never}$，可能会急剧增加 $\texttt{Yes}$ 的概率。此外，在对齐模型以拒绝不安全提示时，我们表明这种位移可能通过将概率质量从偏好的拒绝响应转移到有害响应（例如，将 Llama-3-8B-Instruct 的拒绝率从 74.4% 降至 33.4%），无意中导致模型失配。我们从理论上刻画了似然位移是由诱导相似嵌入的偏好所驱动的，这通过中心化隐藏嵌入相似度（CHES）分数来衡量。实证上，CHES 分数能够识别给定数据集中哪些训练样本对似然位移贡献最大。在我们的实验中，过滤掉这些样本有效缓解了无意识的失配。更广泛而言，我们的结果凸显了筛选具有足够差异性偏好的数据的重要性，对此我们相信 CHES 分数可能具有重要价值。