Large language models (LLMs) excel in various capabilities but also pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment from the perspective of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects. Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse and effective. We can restore $90$% safety performance with intervention only on about $5$% of all the neurons. (2) Safety neurons encode transferrable mechanisms. They exhibit consistent effectiveness on different red-teaming datasets. The finding of safety neurons also interprets "alignment tax". We observe that the identified key neurons for safety and helpfulness significantly overlap, but they require different activation patterns of the shared neurons. Furthermore, we demonstrate an application of safety neurons in detecting unsafe outputs before generation. Our findings may promote further research on understanding LLM alignment. The source codes will be publicly released to facilitate future research.
翻译:大型语言模型(LLMs)在多种能力上表现卓越,但也存在安全风险,例如生成有害内容和错误信息,即使在经过安全对齐后仍可能出现。本文从机制可解释性的角度探索安全对齐的内在机制,重点识别和分析LLMs中负责安全行为的安全神经元。我们提出生成时激活对比法来定位这些神经元,并采用动态激活修补技术评估其因果效应。在多个最新LLMs上的实验表明:(1)安全神经元具有稀疏性和高效性。仅对约$5$%的神经元进行干预即可恢复$90$%的安全性能。(2)安全神经元编码可迁移机制。它们在不同红队测试数据集上均表现出稳定的有效性。安全神经元的发现也为"对齐税"现象提供了解释。我们观察到,安全性和有用性对应的关键神经元存在显著重叠,但二者需要共享神经元表现出不同的激活模式。此外,我们展示了安全神经元在生成前检测不安全输出的应用潜力。本研究或可推动对LLM对齐机制的进一步探索。源代码将公开发布以促进后续研究。