Recently, tremendous strides have been made to align the generation of Large Language Models (LLMs) with human values to mitigate toxic or unhelpful content. Leveraging Reinforcement Learning from Human Feedback (RLHF) proves effective and is widely adopted by researchers. However, implementing RLHF is complex, and its sensitivity to hyperparameters renders achieving stable performance and scalability challenging. Furthermore, prevailing approaches to preference alignment primarily concentrate on pairwise comparisons, with limited exploration into multi-response scenarios, thereby overlooking the potential richness within the candidate pool. For the above reasons, we propose a new approach: Listwise Reward Enhancement for Preference Alignment (LIRE), a gradient-based reward optimization approach that incorporates the offline rewards of multiple responses into a streamlined listwise framework, thus eliminating the need for online sampling during training. LIRE is straightforward to implement, requiring minimal parameter tuning, and seamlessly aligns with the pairwise paradigm while naturally extending to multi-response scenarios. Moreover, we introduce a self-enhancement algorithm aimed at iteratively refining the reward during training. Our experiments demonstrate that LIRE consistently outperforms existing methods across several benchmarks on dialogue and summarization tasks, with good transferability to out-of-distribution data, assessed using proxy reward models and human annotators.
翻译:近年来,为使大语言模型(LLMs)的生成内容与人类价值观对齐以减少有害或无益内容,相关研究已取得巨大进展。基于人类反馈的强化学习(RLHF)被证明是有效的,并已得到研究者的广泛采用。然而,RLHF 的实现较为复杂,且其对超参数的敏感性使得获得稳定性能和可扩展性面临挑战。此外,主流的偏好对齐方法主要集中于成对比较,对多响应场景的探索有限,因而忽视了候选池中可能存在的丰富信息。基于上述原因,我们提出一种新方法:面向偏好对齐的列表式奖励增强(LIRE)。这是一种基于梯度的奖励优化方法,它将多个响应的离线奖励整合到一个简化的列表式框架中,从而消除了训练期间在线采样的需求。LIRE 实现简单,仅需极少的参数调整,不仅能与成对比较范式无缝衔接,还能自然地扩展到多响应场景。此外,我们引入了一种自增强算法,旨在训练期间迭代优化奖励。我们的实验表明,在使用代理奖励模型和人工标注者进行评估时,LIRE 在对话和摘要任务的多个基准测试中均持续优于现有方法,并且对分布外数据具有良好的可迁移性。