Reward Models Inherit Value Biases from Pretraining

Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers' choice of base model is as much a consideration of values as of performance.

翻译：奖励模型（RMs）是将大语言模型（LLMs）与人类价值观对齐的核心，但其受到的关注度低于预训练及后训练的LLMs本身。由于RMs初始化为LLMs，它们继承了塑造其行为的表征，但这种影响的性质与程度仍未得到充分研究。通过使用经过验证的心理语言学语料库对10个领先的开源权重RMs进行综合研究，我们发现RMs在人类价值的多个维度上表现出显著差异，且差异取决于其基础模型。利用"大二"心理轴，我们展示了Llama系列RMs对"能动性"的强烈偏好，以及Gemma系列RMs对"共融性"的相应强烈偏好。即使偏好数据与微调过程完全相同，此现象依然存在；我们将其追溯至各自经过指令微调与预训练模型的logits。这些对数概率差异本身可被表述为一个隐式RM；我们推导出可用的隐式奖励分数，并证明其表现出完全相同的能动性/共融性差异。我们进行了针对偏好数据来源与数量的消融实验来训练RMs，结果表明该效应不仅可重复，且具有惊人的持久性。尽管RMs旨在代表人类偏好，但我们的证据显示其输出受到所基于的预训练LLMs的影响。这项工作强调了预训练阶段安全与对齐工作的重要性，并明确指出开源开发者选择基础模型时，价值观考量与性能考量同等重要。