评估数学推理奖励模型的鲁棒性 (Evaluating Robustness of Reward Models for Mathematical Reasoning)

Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it only see an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking. In this work, we introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH, a benchmark that effectively represents the robustness of reward models in mathematical reasoning tasks. We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization, whereas the existing benchmark shows almost no correlation. The results underscore the potential of our design to enhance the reliability of evaluation, and represent the robustness of reward model. We make our code and data publicly available.

翻译：奖励模型在基于人类反馈的强化学习（RLHF）系统中至关重要，其作用在于使模型行为与人类偏好保持一致。特别是在数学领域，已有大量研究利用奖励模型来对齐策略以提升推理能力。近年来，随着奖励模型的重要性日益凸显，RewardBench被提出以深入理解其行为。然而，我们发现RewardBench的数学子集中，被采纳的完成结果与被拒绝的完成结果之间存在表征差异，且仅依赖单一比较，这可能导致不可靠的结果，因为其仅考察孤立案例。因此，该方法无法准确反映奖励模型的鲁棒性，进而引发对其性能的误解，并可能导致奖励破解。在本研究中，我们引入了一种用于可靠评估奖励模型的新设计。为验证该设计，我们构建了RewardMATH基准测试，该基准能有效表征奖励模型在数学推理任务中的鲁棒性。我们证明，RewardMATH的评分与优化策略的结果高度相关，并能有效估计奖励过优化现象，而现有基准测试则几乎未显示相关性。这些结果凸显了我们设计在提升评估可靠性方面的潜力，并准确表征了奖励模型的鲁棒性。我们已公开提供相关代码与数据。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/