Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i.e., Skip Gradient Method and Intermediate Level Attack, into gradient-based adversarial prompt generation and achieve significant performance gains without introducing obvious computational cost. Meanwhile, by discussing mechanisms behind the gains, new insights are drawn, and proper combinations of these methods are also developed. Our empirical results show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench. This match rate is 33% higher than that of a very strong baseline known as GCG, demonstrating advanced discrete optimization for adversarial prompt generation against LLMs. In addition, without introducing obvious cost, the combination achieves >30% absolute increase in attack success rates compared with GCG when generating both query-specific (38% -> 68%) and universal adversarial prompts (26.68% -> 60.32%) for attacking the Llama-2-7B-Chat model on AdvBench. Code at: https://github.com/qizhangli/Gradient-based-Jailbreak-Attacks.

翻译：基于梯度的对抗提示生成方法在执行针对安全对齐大语言模型的自动越狱攻击时展现出卓越性能。然而，由于文本的离散特性，大语言模型的输入梯度难以精确反映提示中词元替换所导致的损失变化幅度，这导致即使在白盒设置下，针对安全对齐大语言模型的攻击成功率也有限。本文从新的视角探讨了该问题，提出可以通过借鉴最初为攻击黑盒图像分类模型而提出的基于迁移的攻击方法中的创新思路来缓解此问题。我们首次将这类基于迁移攻击中的有效方法（即Skip Gradient Method和Intermediate Level Attack）的核心思想，引入到基于梯度的对抗提示生成中，并在未引入明显计算成本的情况下实现了显著的性能提升。同时，通过探讨性能提升背后的机制，我们获得了新的见解，并开发了这些方法的适当组合。我们的实验结果表明，所开发组合生成的查询特定对抗后缀中，有87%能够诱导Llama-2-7B-Chat在AdvBench上产生与目标字符串完全匹配的输出。该匹配率比一个名为GCG的强基线方法高出33%，这证明了在针对大语言模型的对抗提示生成方面实现了先进的离散优化。此外，在未引入明显成本的情况下，该组合在AdvBench上为攻击Llama-2-7B-Chat模型生成查询特定对抗提示（38% -> 68%）和通用对抗提示（26.68% -> 60.32%）时，相比GCG实现了超过30%的绝对攻击成功率提升。代码位于：https://github.com/qizhangli/Gradient-based-Jailbreak-Attacks。