While Large Language Models (LLMs) are aligned to mitigate risks, their safety guardrails remain fragile against jailbreak attacks. This reveals limited understanding of components governing safety. Existing methods rely on local, greedy attribution that assumes independent component contributions. However, they overlook the cooperative interactions between different components in LLMs, such as attention heads, which jointly contribute to safety mechanisms. We propose \textbf{G}lobal \textbf{O}ptimization for \textbf{S}afety \textbf{V}ector Extraction (GOSV), a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously. We employ two complementary activation repatching strategies: Harmful Patching and Zero Ablation. These strategies identify two spatially distinct sets of safety vectors with consistently low overlap, termed Malicious Injection Vectors and Safety Suppression Vectors, demonstrating that aligned LLMs maintain separate functional pathways for safety purposes. Through systematic analyses, we find that complete safety breakdown occurs when approximately 30\% of total heads are repatched across all models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white-box attacks across all test models, providing strong evidence for the effectiveness of the proposed GOSV framework on LLM safety interpretability.
翻译:尽管大语言模型(LLMs)经过对齐以降低风险,但其安全防护机制在面对越狱攻击时仍显脆弱。这反映出对控制安全性的模型组件理解有限。现有方法依赖于局部贪婪归因,其假设各组件贡献相互独立。然而,这些方法忽视了LLM中不同组件(如注意力头)间的协同交互作用,这些组件共同构成了安全机制。我们提出\textbf{全局优化安全向量提取}(GOSV)框架,该框架通过同时对所有注意力头进行全局优化来识别安全关键注意力头。我们采用两种互补的激活重定向策略:有害重定向与零值消融。这些策略识别出空间分布不同且重叠率持续较低的两组安全向量,分别称为恶意注入向量与安全抑制向量,这表明经过对齐的LLM为安全目的保留了独立的功能通路。通过系统分析,我们发现当所有模型中总计约30\%的注意力头被重定向时,会引发完全的安全失效。基于这些发现,我们开发了一种新颖的推理时白盒越狱方法,该方法通过激活重定向来利用已识别的安全向量。我们的攻击在所有测试模型上均显著优于现有白盒攻击方法,为所提出的GOSV框架在LLM安全性可解释性方面的有效性提供了有力证据。