Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or component are suppressed, the safety capability of LLMs are compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms, despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in the safety-related mechanistic interpretability. We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads' contributions to model safety. Based on this, we generalize Ships to the dataset level and further introduce the Safety Attention Head AttRibution Algorithm (Sahara) to attribute the critical safety attention heads inside the model. Our findings show that the special attention head has a significant impact on safety. Ablating a single safety head allows aligned model (e.g., Llama-2-7b-chat) to respond to 16 times more harmful queries, while only modifying 0.006% of the parameters, in contrast to the ~ 5% modification required in previous studies. More importantly, we demonstrate that attention heads primarily function as feature extractors for safety and models fine-tuned from the same base model exhibit overlapping safety heads through comprehensive experiments. Together, our attribution approach and findings provide a novel perspective for unpacking the black box of safety mechanisms within large models.
翻译:大语言模型(LLM)在多项语言任务上取得了最先进的性能,但其安全防护措施可能被绕过,导致有害内容的生成。鉴于此,近期关于安全机制的研究不断涌现,揭示了当安全表征或组件被抑制时,LLM的安全能力会受到损害。然而,现有研究往往忽视了多头注意力机制对安全性的影响,尽管该机制在各种模型功能中起着至关重要的作用。因此,本文旨在探索标准注意力机制与安全能力之间的联系,以填补安全相关机制可解释性研究中的这一空白。我们提出了一种专为多头注意力设计的新颖度量指标——安全头重要性分数(Ships),用于评估各个注意力头对模型安全性的贡献。在此基础上,我们将Ships推广至数据集层面,并进一步引入了安全注意力头归因算法(Sahara),以识别模型内部关键的安全注意力头。我们的研究结果表明,特定的注意力头对安全性具有显著影响。仅消融单个安全头(例如,在Llama-2-7b-chat模型中),即可使对齐后的模型对有害查询的响应数量增加16倍,而参数修改量仅为0.006%,远低于以往研究中约5%的修改量。更重要的是,我们通过大量实验证明,注意力头主要充当安全性的特征提取器,并且从同一基础模型微调而来的模型展现出重叠的安全头。总之,我们的归因方法与发现为解析大模型内部安全机制的黑箱提供了一个全新的视角。