Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. However, not all preference pairs are equal: while in some cases the preferred response is only slightly better than the dispreferred response, there can be a stronger preference for one response when, for example, the other response includes harmful or toxic content. In this paper, we propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning. Intuitively, ODPO requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. The offset is determined based on the extent to which one response is preferred over another. Our experiments on various tasks suggest that ODPO significantly outperforms DPO in aligning language models, especially when the number of preference pairs is limited.
翻译:直接偏好优化(DPO)是一种成功的微调策略,无需训练奖励模型或使用强化学习即可将大型语言模型与人类偏好对齐。DPO最初形式依赖于二元偏好数据,通过微调语言模型来增加偏好响应相对于非偏好响应的可能性。然而,并非所有偏好对都是等价的:在某些情况下,偏好响应仅略优于非偏好响应,但当非偏好响应包含有害或有毒内容时,对某一响应可能存在更强烈的偏好。本文提出DPO的泛化版本,称为带偏移的直接偏好优化(ODPO),其在微调过程中不将所有偏好对视为等价。直观上,ODPO要求偏好响应与非偏好响应的可能性差异大于某个偏移值。该偏移值根据一个响应相对于另一个响应的偏好程度确定。我们在各种任务上的实验表明,ODPO在语言模型对齐方面显著优于DPO,尤其是在偏好对数量有限的情况下。