Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to responses with the highest rewards, which are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. This dataset is easily integrated with existing direct alignment algorithms and is applicable to any preference dataset. The experimental results across instruction-following benchmarks including AlpacaEval, MT-Bench, and Arena-Hard-Auto demonstrate that our approach consistently boosts the performance of DPO by a considerable margin across diverse models. Additionally, our method improves the average accuracy on various academic benchmarks. When applying our method to on-policy data, the resulting DPO model achieves SOTA results on AlpacaEval. Through ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere dataset expansion. Our code is available at https://github.com/shenao-zhang/reward-augmented-preference.
翻译:大型语言模型(LLM)的偏好对齐显著提升了其遵循人类指令与意图的能力。然而,现有的直接对齐算法主要关注相对偏好,往往忽视了响应的质量维度。单纯追求最大化被选响应与稍逊的被拒响应之间的隐含奖励差距,可能导致过拟合以及对高质量被拒响应的不必要遗忘。对奖励分数的忽视还会驱使LLM不加区分地偏好低质量的被选响应,并难以泛化至数据中稀疏存在的、具有最高奖励的响应。为克服这些不足,本研究引入了奖励条件化的LLM策略,该策略能够识别并学习数据集中全部响应质量谱系,从而帮助外推至更优区域。我们提出了一种有效且简单的数据重标注方法,将偏好对基于质量分数进行条件化处理,以构建奖励增强数据集。该数据集易于与现有直接对齐算法集成,并适用于任何偏好数据集。在包括AlpacaEval、MT-Bench和Arena-Hard-Auto在内的指令遵循基准测试中,实验结果表明我们的方法在不同模型上均能持续显著提升DPO的性能。此外,我们的方法还提高了多种学术基准测试的平均准确率。当将我们的方法应用于在线策略数据时,所得的DPO模型在AlpacaEval上取得了最先进的结果。通过消融研究,我们证明该方法不仅最大化利用了偏好数据的效用,还缓解了遗忘问题,展现了其超越单纯数据集扩展的广泛有效性。我们的代码公开于 https://github.com/shenao-zhang/reward-augmented-preference。