As the integration of the Large Language Models (LLMs) into various applications increases, so does their susceptibility to misuse, raising significant security concerns. Numerous jailbreak attacks have been proposed to assess the security defense of LLMs. Current jailbreak attacks mainly rely on scenario camouflage, prompt obfuscation, prompt optimization, and prompt iterative optimization to conceal malicious prompts. In particular, sequential prompt chains in a single query can lead LLMs to focus on certain prompts while ignoring others, facilitating context manipulation. This paper introduces SequentialBreak, a novel jailbreak attack that exploits this vulnerability. We discuss several scenarios, not limited to examples like Question Bank, Dialog Completion, and Game Environment, where the harmful prompt is embedded within benign ones that can fool LLMs into generating harmful responses. The distinct narrative structures of these scenarios show that SequentialBreak is flexible enough to adapt to various prompt formats beyond those discussed. Extensive experiments demonstrate that SequentialBreak uses only a single query to achieve a substantial gain of attack success rate over existing baselines against both open-source and closed-source models. Through our research, we highlight the urgent need for more robust and resilient safeguards to enhance LLM security and prevent potential misuse. All the result files and website associated with this research are available in this GitHub repository: https://anonymous.4open.science/r/JailBreakAttack-4F3B/.
翻译:随着大语言模型(LLMs)在各类应用中的集成日益广泛,其被滥用的风险也相应增加,引发了重大的安全关切。已有大量越狱攻击被提出以评估LLMs的安全防御能力。当前的越狱攻击主要依赖场景伪装、提示混淆、提示优化及提示迭代优化等手段来隐藏恶意提示。特别地,单个查询中的序列提示链可导致LLMs聚焦于某些提示而忽略其他提示,从而为上下文操控提供便利。本文提出一种新型越狱攻击方法SequentialBreak,该方法正是利用了此漏洞。我们探讨了多种场景(不限于题库、对话补全和游戏环境等示例),其中有害提示被嵌入良性提示中,从而诱使LLMs生成有害响应。这些场景各异的叙事结构表明,SequentialBreak具备足够的灵活性,可适应超出本文讨论范围的多种提示格式。大量实验证明,针对开源和闭源模型,SequentialBreak仅需单次查询即可在攻击成功率上较现有基线取得显著提升。通过本研究,我们强调亟需建立更鲁棒、更具韧性的防护机制以增强LLM安全性,防范潜在滥用。本研究相关的所有结果文件及网站均可在以下GitHub仓库获取:https://anonymous.4open.science/r/JailBreakAttack-4F3B/。