The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries surpassed previous art by 33.1\%.
翻译:大语言模型(LLMs)的安全对齐机制易受人工和自动化越狱攻击的威胁,这类攻击会诱导LLMs输出有害内容。然而,当前嵌套完整有害提示的LLMs越狱方法在隐藏恶意意图方面效果有限,易被对齐良好的LLMs识别并拒绝。本文发现,将恶意提示分解为独立的子提示,通过呈现碎片化、低检测性的形式,可有效掩盖其潜在恶意意图,从而突破上述局限。我们提出了一种自动化提示分解与重构越狱攻击框架(DrAttack)。该框架包含三个关键组件:(a)将原始提示"分解"为子提示;(b)通过上下文学习,利用语义相似但无害的重组示例对子提示进行隐式"重构";(c)对子提示进行"同义词搜索",旨在寻找既能保持原始意图又能实现LLMs越狱的子提示同义词。跨多个开源和闭源LLMs的大规模实证研究表明,DrAttack在显著减少查询次数的情况下,相较于现有最优的纯提示攻击方法,获得了显著的攻击成功率提升。值得注意的是,在仅15次查询的条件下,针对GPT-4的78.0%攻击成功率较先前最优方法提升了33.1%。