The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.
翻译:对齐大型语言模型(LLMs)是通过与人类反馈广泛对齐而建立的强大语言理解与决策工具。然而,这些大型模型仍易受越狱攻击——攻击者操纵提示以引发对齐LLMs不应给出的恶意输出。研究越狱提示有助于我们深入探究LLMs的局限性,并进一步指导其安全性加固。不幸的是,现有越狱技术存在以下问题:(1)可扩展性不足——攻击严重依赖人工设计提示;(2)隐秘性缺陷——基于令牌的算法生成的提示往往语义无意义,易被基础困惑度检测捕获。针对这些挑战,我们试图回答:能否开发一种能自动生成隐秘越狱提示的方法?本文提出AutoDAN——一种针对对齐LLMs的新型越狱攻击方法。AutoDAN通过精心设计的层次化遗传算法,可自动生成语义连贯的隐秘越狱提示。大量评估表明,AutoDAN不仅实现了攻击过程的自动化并保持语义完整性,还在跨模型迁移性、跨样本普适性方面展现出优于基准的强攻击能力。此外,我们还将AutoDAN与基于困惑度的防御方法对比,证明其能有效绕过此类防御机制。