Secrets of RLHF in Large Language Models Part II: Reward Modeling

Binghai Wang,Rui Zheng,Lu Chen,Yan Liu,Shihan Dou,Caishuang Huang,Wei Shen,Senjie Jin,Enyu Zhou,Chenyu Shi,Songyang Gao,Nuo Xu,Yuhao Zhou,Xiaoran Fan,Zhiheng Xi,Jun Zhao,Xiao Wang,Tao Ji,Hang Yan,Lixing Shen,Zhan Chen,Tao Gui,Qi Zhang,Xipeng Qiu,Xuanjing Huang,Zuxuan Wu,Yu-Gang Jiang

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training. In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization.

翻译：基于人类反馈的强化学习（RLHF）已成为将语言模型与人类价值观和意图对齐的关键技术，使模型能够生成更有帮助且更无害的响应。奖励模型作为人类偏好的代理进行训练，以驱动强化学习优化。尽管奖励模型通常被视为实现高性能的核心，但在实际应用中面临以下挑战：（1）数据集中存在的错误和模糊偏好对可能阻碍奖励模型准确捕捉人类意图；（2）在特定分布数据上训练的奖励模型往往难以泛化到该分布之外的样本，且不适用于迭代式RLHF训练。本报告尝试解决这两个问题：（1）从数据角度出发，我们提出了一种基于多奖励模型投票机制的方法，用于衡量数据中偏好强度的差异。实验结果证实，不同偏好强度的数据对奖励模型性能具有不同影响。我们引入了一系列新颖方法，以减少数据集中错误和模糊偏好的影响，并充分利用高质量偏好数据。（2）从算法层面，我们引入对比学习来增强奖励模型区分选定响应与拒绝响应的能力，从而提升模型泛化性。此外，我们采用元学习使奖励模型在分布外样本中保持区分细微差异的能力，该方法可用于迭代式RLHF优化。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/