Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available reward models and two instruction-tuned models used as reward proxies, we find substantial variation across domains, with no single model performing best overall. The models fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions. Moreover, stronger bias avoidance can reduce sensitivity to context, revealing a key alignment trade-off between avoiding biased outcomes and preserving contextual faithfulness. These findings show that standard reward benchmarks are insufficient for assessing social alignment and highlight the need for evaluations that directly measure the social preferences encoded in reward models.
翻译:奖励模型是大语言模型对齐中的关键组成部分,在训练过程中充当人类偏好的代理。然而,现有评估主要聚焦于通用指令遵循基准测试,未能深入揭示这些模型是否捕捉到社会期望的偏好。这导致社会对齐中的重要失败可能被隐藏。我们将奖励模型基准测试拓展至四个具有社会影响的领域:偏见、安全性、道德规范和伦理推理。我们提出一个框架,将社会评估数据集转化为成对偏好数据,在可用时利用黄金标签,否则利用方向性偏差指标。这使我们能够检验奖励模型是否偏好社会不良回应,以及其偏好是否在选定输出中产生系统性偏差分布。通过对五个公开奖励模型和两个作为奖励代理的指令调优模型进行评估,我们发现不同领域间存在显著差异,且没有单一模型在整体上表现最佳。这些模型远未达到强社会智能水平:它们常偏好社会不良选项,其偏好产生系统性偏差分布。此外,更强的偏见规避能力可能降低对上下文的敏感性,这揭示了避免偏见结果与保持上下文忠实性之间的关键对齐权衡。这些发现表明,标准奖励基准不足以评估社会对齐,并凸显了直接衡量奖励模型编码的社会偏好的评估需求。