佯攻与主攻：基于注意力机制的大语言模型越狱与防护策略 (Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs)

Jailbreak attack can be used to access the vulnerabilities of Large Language Models (LLMs) by inducing LLMs to generate the harmful content. And the most common method of the attack is to construct semantically ambiguous prompts to confuse and mislead the LLMs. To access the security and reveal the intrinsic relation between the input prompt and the output for LLMs, the distribution of attention weight is introduced to analyze the underlying reasons. By using statistical analysis methods, some novel metrics are defined to better describe the distribution of attention weight, such as the Attention Intensity on Sensitive Words (Attn_SensWords), the Attention-based Contextual Dependency Score (Attn_DepScore) and Attention Dispersion Entropy (Attn_Entropy). By leveraging the distinct characteristics of these metrics, the beam search algorithm and inspired by the military strategy "Feint and Attack", an effective jailbreak attack strategy named as Attention-Based Attack (ABA) is proposed. In the ABA, nested attack prompts are employed to divert the attention distribution of the LLMs. In this manner, more harmless parts of the input can be used to attract the attention of the LLMs. In addition, motivated by ABA, an effective defense strategy called as Attention-Based Defense (ABD) is also put forward. Compared with ABA, the ABD can be used to enhance the robustness of LLMs by calibrating the attention distribution of the input prompt. Some comparative experiments have been given to demonstrate the effectiveness of ABA and ABD. Therefore, both ABA and ABD can be used to access the security of the LLMs. The comparative experiment results also give a logical explanation that the distribution of attention weight can bring great influence on the output for LLMs.

翻译：越狱攻击可通过诱导大语言模型生成有害内容来探测其安全漏洞。最常见的攻击方法是构建语义模糊的提示词以混淆和误导大语言模型。为评估大语言模型的安全性并揭示输入提示与输出之间的内在关联，本文引入注意力权重分布来分析其深层机制。通过运用统计分析方法，定义了若干新颖指标以更精准地描述注意力权重分布，例如敏感词注意力强度、基于注意力的上下文依赖分数以及注意力分散熵。结合这些指标的独特性质，借鉴波束搜索算法并受军事策略"佯攻与主攻"启发，提出了一种名为基于注意力的攻击的有效越狱攻击策略。在该策略中，采用嵌套式攻击提示词以转移大语言模型的注意力分布。通过这种方式，输入的更多无害部分可被用于吸引大语言模型的注意力。此外，受该攻击策略启发，本文同时提出了一种名为基于注意力的防御的有效防护策略。相较于攻击策略，该防御策略可通过校准输入提示的注意力分布来增强大语言模型的鲁棒性。通过对比实验验证了所提攻击与防御策略的有效性。因此，两种策略均可用于评估大语言模型的安全性。对比实验结果也从逻辑上证明了注意力权重分布会对大语言模型的输出产生重要影响。