Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it only see an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking. In this work, we introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH, a benchmark that effectively represents the robustness of reward models in mathematical reasoning tasks. We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization, whereas the existing benchmark shows almost no correlation. The results underscore the potential of our design to enhance the reliability of evaluation, and represent the robustness of reward model. We make our code and data publicly available.
翻译:奖励模型在基于人类反馈的强化学习(RLHF)系统中至关重要,其作用在于使模型行为与人类偏好保持一致。特别是在数学领域,已有大量研究利用奖励模型来对齐策略以提升推理能力。近年来,随着奖励模型的重要性日益凸显,RewardBench被提出以深入理解其行为。然而,我们发现RewardBench的数学子集中,被采纳的完成结果与被拒绝的完成结果之间存在表征差异,且仅依赖单一比较,这可能导致不可靠的结果,因为其仅考察孤立案例。因此,该方法无法准确反映奖励模型的鲁棒性,进而引发对其性能的误解,并可能导致奖励破解。在本研究中,我们引入了一种用于可靠评估奖励模型的新设计。为验证该设计,我们构建了RewardMATH基准测试,该基准能有效表征奖励模型在数学推理任务中的鲁棒性。我们证明,RewardMATH的评分与优化策略的结果高度相关,并能有效估计奖励过优化现象,而现有基准测试则几乎未显示相关性。这些结果凸显了我们设计在提升评估可靠性方面的潜力,并准确表征了奖励模型的鲁棒性。我们已公开提供相关代码与数据。