Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. While long-length adversarial prompts during AT might lead to strong LLM robustness, their synthesis however is very resource-consuming, which may limit the application of LLM AT. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length $Θ(M)$, it is enough to align LLMs on prompts with adversarial suffixes of length $Θ(\sqrt{M})$. Theoretically, we analyze the adversarial in-context learning of linear transformers on linear regression tasks and prove a robust generalization bound for trained transformers. The bound depends on the term $Θ(\sqrt{M_{\text{test}}}/M_{\text{train}})$, where $M_{\text{train}}$ and $M_{\text{test}}$ are the numbers of adversarially perturbed in-context samples during training and testing. Empirically, we conduct AT on popular open-source LLMs and evaluate their robustness against jailbreak attacks of different adversarial suffix lengths. Results confirm a positive correlation between the attack success rate and the ratio of the square root of the adversarial suffix length during jailbreaking to the length during AT. Our findings show that it is practical to defend against "long-length" jailbreak attacks via efficient "short-length" AT. The code is available at https://github.com/fshp971/adv-icl.
翻译:针对大语言模型(LLMs)的越狱攻击旨在通过精心设计的对抗性提示诱导LLMs产生有害行为。为缓解此类攻击,一种方法是采用基于对抗训练(AT)的对齐策略,即在部分最具对抗性的提示上训练LLMs,帮助其学习在攻击下如何安全响应。在对抗训练过程中,对抗性提示的长度对对齐后LLMs的鲁棒性起着关键作用。虽然训练时使用长长度对抗性提示可能带来更强的模型鲁棒性,但其合成过程资源消耗极大,可能限制LLM对抗训练的实际应用。本文聚焦于对抗性后缀越狱攻击,并揭示:为防御长度为$Θ(M)$的对抗性后缀越狱攻击,仅需在长度为$Θ(\sqrt{M})$的对抗性后缀提示上对齐LLMs即可。理论上,我们分析了线性Transformer在线性回归任务中的对抗性上下文学习,并证明了训练后Transformer的鲁棒泛化界。该界依赖于项$Θ(\sqrt{M_{\text{test}}}/M_{\text{train}})$,其中$M_{\text{train}}$与$M_{\text{test}}$分别为训练和测试阶段受对抗性扰动的上下文样本数量。实证方面,我们在主流开源LLMs上实施对抗训练,并评估其针对不同对抗性后缀长度越狱攻击的鲁棒性。结果证实:攻击成功率与越狱时对抗性后缀长度的平方根同训练时对抗性后缀长度之比呈正相关。我们的研究表明,通过高效的“短长度”对抗训练来防御“长长度”越狱攻击具有实践可行性。代码发布于https://github.com/fshp971/adv-icl。