AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

from arxiv, Version 2 updates: Added comparison of three more evaluation methods and their reliability check using human labeling. Added results for jailbreaking Llama2 (individual behavior) and included complexity and hyperparameter analysis. Revised objectives for prompt leaking. Other minor changes made

Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters; manual jailbreak attacks craft readable prompts, but their limited number due to the necessity of human creativity allows for easy blocking. In this paper, we show that these solutions may be too optimistic. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types. Guided by the dual goals of jailbreak and readability, AutoDAN optimizes and generates tokens one by one from left to right, resulting in readable prompts that bypass perplexity filters while maintaining high attack success rates. Notably, these prompts, generated from scratch using gradients, are interpretable and diverse, with emerging strategies commonly seen in manual jailbreak attacks. They also generalize to unforeseen harmful behaviors and transfer to black-box LLMs better than their unreadable counterparts when using limited training data or a single proxy model. Furthermore, we show the versatility of AutoDAN by automatically leaking system prompts using a customized objective. Our work offers a new way to red-team LLMs and understand jailbreak mechanisms via interpretability.

翻译：大型语言模型的安全对齐可能被手动越狱攻击和（自动）对抗攻击破坏。近期研究表明，防御此类攻击是可能的：对抗攻击虽能生成无限但不可读的乱码提示，但可被基于困惑度的过滤器检测；手动越狱攻击可生成可读提示，但因依赖人类创造力而数量有限，易于拦截。本文表明，这些解决方案可能过于乐观。我们提出AutoDAN——一种可解释的、基于梯度的对抗攻击方法，融合了两类攻击的优势。在越狱性与可读性的双重目标引导下，AutoDAN从左至右逐词优化并生成词元，产出可绕过困惑度过滤器的可读提示，同时保持高攻击成功率。值得注意的是，这些从零开始基于梯度生成的提示具有可解释性与多样性，并涌现出手动越狱攻击中常见的策略。当使用有限训练数据或单一代理模型时，这些提示不仅能泛化至未见危害行为，且相较于不可读提示具有更强的黑盒大型语言模型迁移性。此外，我们通过定制化目标自动泄露系统提示，展示了AutoDAN的通用性。本工作为红队测试大型语言模型及通过可解释性理解越狱机制提供了新思路。