Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement Learning from Human Feedback (RLHF). Despite its promising efficacy, DPO faces a notable drawback: "verbosity", a common over-optimization phenomenon also observed in RLHF. While previous studies mainly attributed verbosity to biased labels within the data, we propose that the issue also stems from an inherent algorithmic length reliance in DPO. Specifically, we suggest that the discrepancy between sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences, used in DPO, results in overestimated or underestimated rewards due to varying token lengths. Empirically, we utilize datasets with different label lengths to demonstrate the presence of biased rewards. We then introduce an effective downsampling approach, named SamPO, to eliminate potential length reliance. Our experimental evaluations, conducted across three LLMs of varying scales and a diverse array of conditional and open-ended benchmarks, highlight the efficacy of SamPO in mitigating verbosity, achieving improvements of 5% to 12% over DPO through debaised rewards. Our codes can be accessed at: https://github.com/LuJunru/SamPO/.
翻译:直接偏好优化(DPO)已成为将大型语言模型(LLM)与人类偏好直接且稳健对齐的重要算法,为复杂的人类反馈强化学习(RLHF)提供了更简洁的替代方案。尽管DPO展现出显著效能,但其存在一个明显缺陷:"冗余性"——这种在RLHF中也常见的过度优化现象。先前研究主要将冗余性归因于数据中的标注偏差,而本文提出该问题同样源于DPO算法固有的长度依赖性。具体而言,我们认为DPO中使用的已选序列与拒绝序列之间的序列级KL散度差异,会因标记长度的变化导致奖励值的高估或低估。我们通过使用不同标注长度的数据集进行实证研究,证明了偏差奖励的存在。随后提出一种有效的下采样方法SamPO,以消除潜在的长度依赖性。我们在三种不同规模的LLM以及多样化的条件生成与开放生成基准测试中进行了实验评估,结果表明SamPO能有效缓解冗余性问题,通过消除奖励偏差实现了相较DPO 5%至12%的性能提升。代码已开源:https://github.com/LuJunru/SamPO/。