Reward models (RMs) play a crucial role in Reinforcement Learning from Human Feedback by serving as proxies for human preferences in aligning large language models. However, they suffer from various biases which could lead to reward hacking. In this paper, we identify a model preference bias in RMs, where they systematically assign disproportionately high scores to responses from certain policy models, leading to unfair judgments. To mitigate this bias, we propose a calibration method named CHatbot Arena calibrated Reward Modeling (CHARM) that leverages Elo scores from the Chatbot Arena to construct debiased preference datasets and adjust reward model scoring. We conduct extensive experiments on reward model benchmarks and human preference alignment. Results demonstrate that our calibrated RMs achieve improved evaluation accuracy on RM-Bench and the Chat-Hard domain of RewardBench, exhibit a stronger correlation with human preferences by producing scores more closely aligned with Elo rankings and improve downstream post-training performance. These results demonstrate that CHARM provides a simple, effective, and broadly applicable approach to building more reliable and fair reward models.
翻译:奖励模型在基于人类反馈的强化学习中扮演着关键角色,作为人类偏好的代理以对齐大型语言模型。然而,它们存在多种偏差,可能导致奖励攻击。本文识别了奖励模型中的模型偏好偏差,即它们系统性地为某些策略模型的响应分配过高的分数,导致不公平的判断。为缓解此偏差,我们提出一种名为CHatbot Arena校准奖励建模的校准方法,该方法利用Chatbot Arena的Elo分数构建去偏好的偏好数据集并调整奖励模型评分。我们在奖励模型基准和人类偏好对齐任务上进行了大量实验。结果表明,经过校准的奖励模型在RM-Bench和RewardBench的Chat-Hard领域上实现了更高的评估准确率,通过生成与Elo排名更紧密对齐的分数,展现出与人类偏好更强的相关性,并提升了下游后训练性能。这些结果证明,CHARM为构建更可靠、更公平的奖励模型提供了一种简单、有效且广泛适用的方法。