Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed $\mathcal{I}$-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.
翻译:大型语言模型(LLMs)正在快速发展,其广泛部署的关键在于安全对齐。众多红队测试致力于对LLMs进行越狱攻击,其中贪婪坐标梯度(GCG)攻击的成功引发了学界对基于优化的越狱技术研究的广泛关注。尽管GCG是一个重要里程碑,其攻击效率仍不尽如人意。本文针对GCG等基于优化的越狱方法提出了若干改进的(经验性)技术。我们首先观察到,单一的"Sure"目标模板严重限制了GCG的攻击性能;基于此,我们提出采用包含有害自我暗示和/或误导性指令的多样化目标模板来诱导LLMs。此外,从优化角度出发,我们在GCG中提出了一种自动多坐标更新策略(即自适应决定每步替换的令牌数量)以加速收敛,并引入了由易到难的初始化等技巧。随后,我们整合这些改进技术,开发出一种高效的越狱方法,称为$\mathcal{I}$-GCG。实验中,我们在系列基准测试(如NeurIPS 2023红队测试赛道)上进行评估。结果表明,我们的改进技术能使GCG超越现有最优越狱攻击,实现接近100%的攻击成功率。代码发布于https://github.com/jiaxiaojunQAQ/I-GCG。