Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement Learning from Human Feedback (RLHF). Despite its promising efficacy, DPO faces a notable drawback: "verbosity", a common over-optimization phenomenon also observed in RLHF. While previous studies mainly attributed verbosity to biased labels within the data, we propose that the issue also stems from an inherent algorithmic length reliance in DPO. Specifically, we suggest that the discrepancy between sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences, used in DPO, results in overestimated or underestimated rewards due to varying token lengths. Empirically, we utilize datasets with different label lengths to demonstrate the presence of biased rewards. We then introduce an effective downsampling approach, named SamPO, to eliminate potential length reliance. Our experimental evaluations, conducted across three LLMs of varying scales and a diverse array of conditional and open-ended benchmarks, highlight the efficacy of SamPO in mitigating verbosity, achieving improvements of 5% to 12% over DPO through debaised rewards. Our codes can be accessed at: https://github.com/LuJunru/SamPO/.
翻译:直接偏好优化(DPO)已成为将大型语言模型(LLM)与人类偏好进行直接且鲁棒对齐的重要算法,为复杂的人类反馈强化学习(RLHF)提供了一种更简洁的替代方案。尽管DPO展现出良好的效能,但其存在一个显著缺陷:"冗余性"——这一在RLHF中也常见的过优化现象。以往研究主要将冗余性归因于数据中的有偏标注,但我们提出该问题同样源于DPO算法固有的长度依赖性。具体而言,我们认为DPO中使用的已选序列与拒绝序列之间的序列级Kullback-Leibler(KL)散度差异,会因标记长度的变化导致奖励值被高估或低估。我们通过使用具有不同标注长度的数据集进行实证研究,证明了有偏奖励的存在。随后,我们提出一种名为SamPO的有效降采样方法以消除潜在的长度依赖。通过在三种不同规模的LLM以及多样化的条件生成与开放生成基准测试上进行实验评估,我们证明了SamPO在缓解冗余性方面的有效性:通过消除奖励偏差,其性能较DPO提升了5%至12%。代码已开源:https://github.com/LuJunru/SamPO/。