Reward Models (RMs) are crucial for aligning language models with human preferences. Currently, the evaluation of RMs depends on measuring accuracy against a validation set of manually annotated preference data. Although this method is straightforward and widely adopted, the relationship between RM accuracy and downstream policy performance remains under-explored. In this work, we conduct experiments in a synthetic setting to investigate how differences in RM measured by accuracy translate into gaps in optimized policy performance. Our findings reveal that while there is a weak positive correlation between accuracy and downstream performance, policies optimized towards RMs with similar accuracy can exhibit quite different performance. Moreover, we discover that the way of measuring accuracy significantly impacts its ability to predict the final policy performance. Through the lens of the Regressional Goodhart effect, we recognize that accuracy, when used for measuring RM quality, can fail to fully capture the potential RM overoptimization. This underscores the inadequacy of relying solely on accuracy to reflect their impact on policy optimization.
翻译:奖励模型对于将语言模型与人类偏好对齐至关重要。当前,奖励模型的评估依赖于测量其在人工标注偏好数据验证集上的准确率。尽管该方法直接且被广泛采用,但奖励模型准确率与下游策略性能之间的关系仍未得到充分探索。在本工作中,我们在合成环境中进行实验,以研究由准确率度量的奖励模型差异如何转化为优化后策略性能的差距。我们的研究结果表明,虽然准确率与下游性能之间存在微弱的正相关性,但针对具有相似准确率的奖励模型进行优化的策略可能表现出相当不同的性能。此外,我们发现准确率的测量方式会显著影响其预测最终策略性能的能力。通过回归性古德哈特定律的视角,我们认识到,当准确率被用于衡量奖励模型质量时,可能无法完全捕捉潜在的奖励模型过度优化现象。这凸显了仅依赖准确率来反映其对策略优化影响的不足。