Large language models (LLMs) aligned to human preferences via reinforcement learning from human feedback (RLHF) underpin many commercial applications. However, how RLHF impacts LLM internals remains opaque. We propose a novel method to interpret learned reward functions in RLHF-tuned LLMs using sparse autoencoders. Our approach trains autoencoder sets on activations from a base LLM and its RLHF-tuned version. By comparing autoencoder hidden spaces, we identify unique features that reflect the accuracy of the learned reward model. To quantify this, we construct a scenario where the tuned LLM learns token-reward mappings to maximize reward. This is the first application of sparse autoencoders for interpreting learned rewards and broadly inspecting reward learning in LLMs. Our method provides an abstract approximation of reward integrity. This presents a promising technique for ensuring alignment between specified objectives and model behaviors.
翻译:通过人类反馈强化学习(RLHF)与人类偏好对齐的大语言模型支撑着众多商业应用。然而,RLHF如何影响大语言模型的内部机制仍不透明。我们提出一种新颖方法,利用稀疏自编码器解释RLHF微调大语言模型中的学习奖励函数。该方法在基础大语言模型及其RLHF微调版本的激活值上训练自编码器集合。通过比较自编码器隐藏空间,我们识别出反映学习奖励模型准确性的独特特征。为量化这一特性,我们构建了一个场景,使微调模型通过学习令牌-奖励映射来最大化奖励。这是首次将稀疏自编码器应用于解释学习奖励机制,并广泛探究大语言模型中的奖励学习过程。我们的方法提供了奖励完整性的抽象近似,为确保指定目标与模型行为的一致性提供了有前景的技术手段。