The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.

翻译：对齐训练的初衷是让大型语言模型既安全又有用。其主要机制——基于人类反馈的强化学习（RLHF）通过使模型与“人类价值观”对齐来塑造部署后语言模型的行为。然而，这一过程并不透明。究竟编码了哪些价值观？这些价值观属于谁？RLHF又是如何编码它们的？越来越多的证据表明，RLHF仅产生功能性服从而非深层对齐。我们以党派政治倾向为例，通过对Llama 3.1 8B在RLHF前后内部表征的对比，对这一现象进行了机制性案例研究。研究表明，RLHF并未消除基础模型中结构化的党派方向，而是压缩了党派信号的方差，以生成始终平衡且非党派的输出。稀疏自编码器分解揭示，在基础模型中零星激活的策略编码特征，在Instruct模型中完全失活。特征级操控实验证实了这种因果脱节。因此，RLHF编码政治中立规范的方式并非抹除模型对党派性的知识，而是切断从党派几何结构到输出生成的因果通路。重要的是，这种中立性是功能性的而非结构性的，因此支持党派操控的底层几何结构仍然完整。那些绕过RLHF护栏的机制（例如推断并放大用户的党派身份）会重新激活党派性输出。如果RLHF的工作原理是切断而非移除承载价值的结构，那么其他价值领域可能也遵循相同模式，而对齐模型的行为可能比其输出所显示的更为脆弱。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/