Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.
翻译:奖励模型是从人类反馈中进行强化学习(RLHF)的关键组件,用于使语言模型同时符合帮助性和无害性行为。然而,这些目标背后的内部机制及其冲突仍未得到充分理解。我们研究了在仅帮助性、仅无害性以及混合目标设置下训练的奖励模型中的对齐张力。我们发现混合目标模型通常表现不如单目标模型,这表明目标之间存在干扰。通过基于激活的方法,我们识别了与每个目标相关的神经元,并通过定向消融研究其功能角色。我们发现这些神经元因果地支持其对应目标,同时常常对对立目标产生负面影响。我们还发现帮助性和无害性之间存在相当比例的共享神经元,这些共享神经元对模型行为施加了不成比例的影响,从而加剧了对齐张力。此外,我们的结果为对齐目标如何在奖励模型中表征以及为何多目标对齐仍具有挑战性提供了见解和机制性解释,从而推动了未来关于可解耦及可控对齐方法的研究。