Misaligned by Reward: Socially Undesirable Preferences in LLMs

Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available reward models and two instruction-tuned models used as reward proxies, we find substantial variation across domains, with no single model performing best overall. The models fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions. Moreover, stronger bias avoidance can reduce sensitivity to context, revealing a key alignment trade-off between avoiding biased outcomes and preserving contextual faithfulness. These findings show that standard reward benchmarks are insufficient for assessing social alignment and highlight the need for evaluations that directly measure the social preferences encoded in reward models.

翻译：奖励模型是大语言模型对齐中的关键组成部分，在训练过程中充当人类偏好的代理。然而，现有评估主要聚焦于通用指令遵循基准测试，未能深入揭示这些模型是否捕捉到社会期望的偏好。这导致社会对齐中的重要失败可能被隐藏。我们将奖励模型基准测试拓展至四个具有社会影响的领域：偏见、安全性、道德规范和伦理推理。我们提出一个框架，将社会评估数据集转化为成对偏好数据，在可用时利用黄金标签，否则利用方向性偏差指标。这使我们能够检验奖励模型是否偏好社会不良回应，以及其偏好是否在选定输出中产生系统性偏差分布。通过对五个公开奖励模型和两个作为奖励代理的指令调优模型进行评估，我们发现不同领域间存在显著差异，且没有单一模型在整体上表现最佳。这些模型远未达到强社会智能水平：它们常偏好社会不良选项，其偏好产生系统性偏差分布。此外，更强的偏见规避能力可能降低对上下文的敏感性，这揭示了避免偏见结果与保持上下文忠实性之间的关键对齐权衡。这些发现表明，标准奖励基准不足以评估社会对齐，并凸显了直接衡量奖励模型编码的社会偏好的评估需求。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日

【EMNLP2025】面向大语言模型的权重旋转偏好优化

专知会员服务

12+阅读 · 2025年8月27日

【ICML2025】大语言模型的有限理性：推理时的“满意化”对齐策略

专知会员服务

11+阅读 · 2025年6月1日

【AAAI2025】偏好导向的监督微调：优先选择目标模型而非对齐的大语言模型

专知会员服务

23+阅读 · 2024年12月18日