Learning Goal-Conditioned Representations for Language Reward Models

Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning (RL). Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback (RLHF) on language models (LMs). In this work, we propose training reward models (RMs) in a contrastive, $\textit{goal-conditioned}$ fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories. This objective significantly improves RM performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well -- on the Helpful-Harmless dataset, we observe $2.3\%$ increase in accuracy. Beyond improving reward model performance, we show this way of training RM representations enables improved $\textit{steerability}$ because it allows us to evaluate the likelihood of an action achieving a particular goal-state (e.g., whether a solution is correct or helpful). Leveraging this insight, we find that we can filter up to $55\%$ of generated tokens during majority voting by discarding trajectories likely to end up in an "incorrect" state, which leads to significant cost savings. We additionally find that these representations can perform fine-grained control by conditioning on desired future goal-states. For example, we show that steering a Llama 3 model towards helpful generations with our approach improves helpfulness by $9.6\%$ over a supervised-fine-tuning trained baseline. Similarly, steering the model towards complex generations improves complexity by $21.6\%$ over the baseline. Overall, we find that training RMs in this contrastive, goal-conditioned fashion significantly improves performance and enables model steerability.

翻译：通过离线数据或自监督目标学习改进表征的技术已在传统强化学习（RL）中展现出令人瞩目的成果。然而，改进的表征学习如何有益于语言模型（LM）上基于人类反馈的强化学习（RLHF）尚不明确。在本工作中，我们提出以对比式、$\textit{目标条件}$的方式训练奖励模型（RM），即增加沿采样偏好轨迹的未来状态表征相似度，同时降低沿随机采样非偏好轨迹的相似度。该目标显著提升了RM性能，在MATH和GSM8k等具有挑战性的基准测试中，AUROC指标最高提升0.09。这些发现也适用于通用对齐任务——在Helpful-Harmless数据集上，我们观察到准确率提升了$2.3\%$。除了提升奖励模型性能，我们还表明这种训练RM表征的方式能够改进$\textit{可操控性}$，因为它允许我们评估一个动作达成特定目标状态（例如，一个解决方案是否正确或有帮助）的可能性。利用这一见解，我们发现在多数投票过程中，通过丢弃可能最终进入“错误”状态的轨迹，可以过滤高达$55\%$的生成标记，从而实现显著的成本节约。此外，我们发现这些表征能够通过以期望的未来目标状态为条件，执行细粒度控制。例如，我们展示使用我们的方法将Llama 3模型引导至有帮助的生成，其有帮助性相比监督微调训练的基线提升了$9.6\%$。类似地，将模型引导至复杂生成，其复杂性相比基线提升了$21.6\%$。总体而言，我们发现以这种对比式、目标条件的方式训练RM能显著提升性能并实现模型的可操控性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日