RewardBench: Evaluating Reward Models for Language Modeling

Nathan Lambert,Valentina Pyatkin,Jacob Morrison,LJ Miranda,Bill Yuchen Lin,Khyathi Chandu,Nouha Dziri,Sachin Kumar,Tom Zick,Yejin Choi,Noah A. Smith,Hannaneh Hajishirzi

from arxiv, 40 pages, 19 figures, 12 tables

Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present RewardBench, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The RewardBench dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

翻译：奖励模型（RMs）是实现成功RLHF（基于人类反馈的强化学习）以将预训练模型与人类偏好对齐的关键，然而针对这些奖励模型评估的研究相对较少。评估奖励模型为理解用于语言模型对齐的模糊技术及其所蕴含的价值观提供了契机。迄今为止，关于能力描述、训练方法或开源奖励模型的资料极为有限。本文提出RewardBench——一个用于评估的基准数据集和代码库，旨在提升对奖励模型的科学认知。RewardBench数据集包含涵盖聊天、推理和安全领域的"提示-胜-负"三元组，用于基准测试奖励模型在具有挑战性、结构化及分布外查询上的表现。我们为奖励模型构建了特定比较数据集，这些数据包含细微但可验证的理由（例如代码错误、事实谬误），用以说明为何一个答案应优于另一个。在RewardBench排行榜上，我们评估了通过多种方法训练的奖励模型，包括分类器的直接极大似然估计训练以及直接偏好优化（DPO）的隐式奖励建模，并覆盖了不同数据集范围。基于对多种奖励模型的拒绝倾向、推理局限性和指令遵循缺陷的深入分析，我们得出诸多发现，旨在更全面地理解RLHF过程。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日