奖励模型可解释性：基于最优与最差令牌的分析 (Reward Model Interpretability via Optimal and Pessimal Tokens)

from arxiv, Accepted for publication in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), to appear June 2025

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

翻译：奖励建模已成为将大型语言模型与人类价值观对齐的关键组成部分。当前研究主要关注将奖励模型作为微调生成模型的手段。然而，奖励模型本身——通过将提示-响应对转换为标量奖励来直接编码人类价值判断——仍相对缺乏深入研究。本文提出一种通过全词汇空间响应分析实现奖励模型可解释性的新方法。通过考察不同奖励模型如何对蕴含价值观的提示的所有可能单令牌响应进行评分，我们揭示了若干显著发现：（1）在相似目标下训练的模型之间存在显著异质性；（2）模型编码高得分与低得分令牌时存在系统性不对称；（3）对提示框架的敏感性反映了人类认知偏差；（4）对高频令牌存在过度估值。我们在十个不同参数量与架构的最新开源奖励模型中验证了这些现象。研究结果挑战了关于奖励模型可互换性及其作为复杂情境依赖的人类价值观代理适用性的假设。我们发现这些模型可能编码针对特定身份群体的偏见，这些偏见可能作为无害性训练的意外后果而出现——此类扭曲风险将通过目前已部署给数百万用户的下游大型语言模型传播。