Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning (RL). Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback (RLHF) on language models (LMs). In this work, we propose training reward models (RMs) in a contrastive, $\textit{goal-conditioned}$ fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories. This objective significantly improves RM performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well -- on the Helpful-Harmless dataset, we observe $2.3\%$ increase in accuracy. Beyond improving reward model performance, we show this way of training RM representations enables improved $\textit{steerability}$ because it allows us to evaluate the likelihood of an action achieving a particular goal-state (e.g., whether a solution is correct or helpful). Leveraging this insight, we find that we can filter up to $55\%$ of generated tokens during majority voting by discarding trajectories likely to end up in an "incorrect" state, which leads to significant cost savings. We additionally find that these representations can perform fine-grained control by conditioning on desired future goal-states. For example, we show that steering a Llama 3 model towards helpful generations with our approach improves helpfulness by $9.6\%$ over a supervised-fine-tuning trained baseline. Similarly, steering the model towards complex generations improves complexity by $21.6\%$ over the baseline. Overall, we find that training RMs in this contrastive, goal-conditioned fashion significantly improves performance and enables model steerability.
翻译:通过离线数据或自监督目标学习改进表征的技术已在传统强化学习(RL)中展现出令人瞩目的成果。然而,改进的表征学习如何有益于语言模型(LM)上基于人类反馈的强化学习(RLHF)尚不明确。在本工作中,我们提出以对比式、$\textit{目标条件}$的方式训练奖励模型(RM),即增加沿采样偏好轨迹的未来状态表征相似度,同时降低沿随机采样非偏好轨迹的相似度。该目标显著提升了RM性能,在MATH和GSM8k等具有挑战性的基准测试中,AUROC指标最高提升0.09。这些发现也适用于通用对齐任务——在Helpful-Harmless数据集上,我们观察到准确率提升了$2.3\%$。除了提升奖励模型性能,我们还表明这种训练RM表征的方式能够改进$\textit{可操控性}$,因为它允许我们评估一个动作达成特定目标状态(例如,一个解决方案是否正确或有帮助)的可能性。利用这一见解,我们发现在多数投票过程中,通过丢弃可能最终进入“错误”状态的轨迹,可以过滤高达$55\%$的生成标记,从而实现显著的成本节约。此外,我们发现这些表征能够通过以期望的未来目标状态为条件,执行细粒度控制。例如,我们展示使用我们的方法将Llama 3模型引导至有帮助的生成,其有帮助性相比监督微调训练的基线提升了$9.6\%$。类似地,将模型引导至复杂生成,其复杂性相比基线提升了$21.6\%$。总体而言,我们发现以这种对比式、目标条件的方式训练RM能显著提升性能并实现模型的可操控性。