Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks, notably the well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate Gradient (GCG) attack has demonstrated efficacy in exploiting this vulnerability by optimizing adversarial prompts through a combination of gradient heuristics and greedy search. However, the efficiency of this attack has become a bottleneck in the attacking process. To mitigate this limitation, in this paper we rethink the generation of adversarial prompts through an optimization lens, aiming to stabilize the optimization process and harness more heuristic insights from previous iterations. Specifically, we introduce the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack, which incorporates a momentum term into the gradient heuristic. Experimental results showcase the notable enhancement achieved by MAP in gradient-based attacks on aligned language models. Our code is available at https://github.com/weizeming/momentum-attack-llm.
翻译:大语言模型在各种任务中取得了显著成功,但依然容易受到对抗性攻击的影响,尤其是文献中广泛记载的“越狱”攻击。近年来,贪婪坐标梯度攻击通过结合梯度启发式和贪婪搜索优化对抗性提示,已证明能有效利用这一漏洞。然而,该攻击的效率已成为攻击过程中的瓶颈。为缓解这一局限性,本文从优化视角重新思考对抗性提示的生成,旨在稳定优化过程并充分利用先前迭代中的启发式信息。具体而言,我们提出动量加速的GCG攻击,该方法在梯度启发式中引入动量项。实验结果表明,MAC在对齐语言模型的梯度攻击中实现了显著提升。我们的代码已开源在https://github.com/weizeming/momentum-attack-llm。