Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack (\textbf{SAHA}), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this finding, \textbf{SAHA} introduces \textit{Ablation-Impact Ranking} head selection strategy to effectively locate the most vital layer for unsafe output. Secondly, we introduce a boundary-aware perturbation method, \textit{i.e. Layer-Wise Perturbation}, to probe the generation of unsafe content with minimal perturbation to the attention. This constrained perturbation guarantees higher semantic relevance with the target intent while ensuring evasion. Extensive experiments show the superiority of our method: SAHA improves ASR by 14\% over SOTA baselines, revealing the vulnerability of the attack surface on the attention head. Our code is available at https://anonymous.4open.science/r/SAHA.
翻译:当前,开源大型语言模型(OSLLMs)已展现出卓越的生成性能。然而,由于其结构与权重公开,即使在经过对齐后,它们仍面临越狱攻击的威胁。现有攻击主要作用于浅层,如提示或嵌入层面,往往未能触及根植于更深层模型组件的漏洞,这为成功防御营造了一种虚假的安全感。本文提出\textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack(\textbf{SAHA}),一种注意力头层面的越狱框架,旨在探索更深层但未充分对齐的注意力头中的脆弱性。SAHA包含两项新颖设计。首先,我们揭示更深层的注意力层会引入更多针对越狱攻击的脆弱性。基于这一发现,\textbf{SAHA}引入\textit{消融影响排序}头选择策略,以有效定位对不安全输出最关键的关键层。其次,我们提出一种边界感知扰动方法,即\textit{逐层扰动},旨在以对注意力的最小扰动来探测不安全内容的生成。这种受约束的扰动在确保规避的同时,保证了与目标意图更高的语义相关性。大量实验证明了我们方法的优越性:SAHA将攻击成功率(ASR)较当前最佳基线提升了14\%,揭示了注意力头层面攻击面的脆弱性。我们的代码发布于https://anonymous.4open.science/r/SAHA。