Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present RewardBench, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The RewardBench dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.
翻译:奖励模型(RMs)是实现成功RLHF(基于人类反馈的强化学习)以将预训练模型与人类偏好对齐的关键,然而针对这些奖励模型评估的研究相对较少。评估奖励模型为理解用于语言模型对齐的模糊技术及其所蕴含的价值观提供了契机。迄今为止,关于能力描述、训练方法或开源奖励模型的资料极为有限。本文提出RewardBench——一个用于评估的基准数据集和代码库,旨在提升对奖励模型的科学认知。RewardBench数据集包含涵盖聊天、推理和安全领域的"提示-胜-负"三元组,用于基准测试奖励模型在具有挑战性、结构化及分布外查询上的表现。我们为奖励模型构建了特定比较数据集,这些数据包含细微但可验证的理由(例如代码错误、事实谬误),用以说明为何一个答案应优于另一个。在RewardBench排行榜上,我们评估了通过多种方法训练的奖励模型,包括分类器的直接极大似然估计训练以及直接偏好优化(DPO)的隐式奖励建模,并覆盖了不同数据集范围。基于对多种奖励模型的拒绝倾向、推理局限性和指令遵循缺陷的深入分析,我们得出诸多发现,旨在更全面地理解RLHF过程。