The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries surpassed previous art by 33.1\%. The project is available at https://github.com/xirui-li/DrAttack.
翻译:大型语言模型(LLMs)的安全对齐机制易受手工及自动化越狱攻击的影响,此类攻击会对抗性地触发LLMs生成有害内容。然而当前越狱方法通常将完整的有害提示嵌套其中,难以有效隐藏恶意意图,易被良好对齐的LLMs轻易识别并拒绝。本文发现,将恶意提示分解为独立的子提示,能够通过呈现碎片化、低检测性的形式有效掩盖其潜在恶意意图,从而克服上述局限。我们提出了一种自动化的提示分解与重构越狱攻击框架(DrAttack)。DrAttack包含三个关键组件:(a)将原始提示"分解"为子提示;(b)通过上下文学习,利用语义相似但无害的重组示例隐式"重构"这些子提示;(c)对子提示进行"同义词搜索",旨在寻找维持原始意图的同时实现LLM越狱的子提示同义词。针对多个开源与闭源LLMs的大量实证研究表明,DrAttack在显著减少查询次数的情况下,相较于现有最先进的纯提示攻击方法取得了大幅成功率提升。值得注意的是,该方法仅用15次查询即在GPT-4上实现了78.0%的成功率,较此前最优结果提高了33.1%。项目代码已开源:https://github.com/xirui-li/DrAttack。