Information-Theoretic Reward Decomposition for Generalizable RLHF

A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.

翻译：在基于人类反馈的强化学习（RLHF）中，可泛化的奖励模型至关重要，因为它能够正确评估未见过的提示-响应对。然而，现有的奖励模型缺乏这种能力，因为它们通常通过增大被选择响应与被拒绝响应之间的奖励差距进行训练，而忽略了响应所依赖的提示。因此，当训练好的奖励模型在数据分布之外的提示-响应对上进行评估时，忽视提示的影响可能导致奖励模型的泛化性能不佳。为解决这一问题，我们将奖励值分解为两个独立分量：与提示无关的奖励和与提示相关的奖励。与提示无关的奖励表示仅由响应决定的评估，而与提示相关的奖励则反映源自提示和响应的奖励。我们从信息论的角度提取这两个分量，无需额外模型。随后，我们提出一种新的奖励学习算法，该算法基于样本的与提示无关奖励值对数据样本进行优先级排序。通过玩具示例，我们证明所提取的与提示无关及与提示相关的奖励能有效表征奖励模型的两个部分。此外，标准评估表明，我们的方法同时提升了奖励模型的对齐性能和泛化能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日