Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present Sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at https://github.com/xzy-101/SAFER-code. \textit{This paper discusses topics related to reward model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}
翻译:基于人类反馈的强化学习(RLHF)是将大型语言模型(LLMs)与人类价值观对齐的关键范式,然而其核心的奖励模型在很大程度上仍是不透明的。在本工作中,我们提出了一种用于增强奖励模型解释与改进的稀疏自编码器框架(\textbf{SAFER}),这是一种通过机制分析来理解和改进奖励模型的新方法。利用稀疏自编码器(SAEs),我们在奖励模型的激活中发现了人类可解释的特征,从而能够洞察与安全相关的决策过程。我们将SAFER应用于面向安全性的偏好数据集,并通过计算被选择响应与拒绝响应之间的激活差异来量化各个特征的重要性。利用这些特征层面的信号,我们设计了有针对性的数据投毒和去噪策略。实验表明,SAFER能够以最小的数据修改量,精确地削弱或增强模型的安全性对齐,同时不牺牲一般的对话性能。我们的方法有助于在高风险的LLM对齐任务中解释、审计和改进奖励模型。代码发布于 https://github.com/xzy-101/SAFER-code。\textit{本文讨论了与奖励模型安全性相关的主题,可能包含强调潜在风险或不安全结果的讨论或示例。}