Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent work suggests that patching LLMs against these attacks is possible: manual jailbreak attacks are human-readable but often limited and public, making them easy to block; adversarial attacks generate gibberish prompts that can be detected using perplexity-based filters. In this paper, we show that these solutions may be too optimistic. We propose an interpretable adversarial attack, \texttt{AutoDAN}, that combines the strengths of both types of attacks. It automatically generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate like manual jailbreak attacks. These prompts are interpretable and diverse, exhibiting strategies commonly used in manual jailbreak attacks, and transfer better than their non-readable counterparts when using limited training data or a single proxy model. We also customize \texttt{AutoDAN}'s objective to leak system prompts, another jailbreak application not addressed in the adversarial attack literature. %, demonstrating the versatility of the approach. We can also customize the objective of \texttt{AutoDAN} to leak system prompts, beyond the ability to elicit harmful content from the model, demonstrating the versatility of the approach. Our work provides a new way to red-team LLMs and to understand the mechanism of jailbreak attacks.
翻译:大型语言模型(LLM)的安全对齐可能会被手动越狱攻击和(自动)对抗攻击所破坏。近期研究表明,通过对LLM进行补丁修复来抵御这些攻击是可能的:手动越狱攻击虽可读但通常有限且公开,易于被阻断;对抗攻击生成的乱码提示可通过基于困惑度的过滤器检测。本文指出,这些解决方案可能过于乐观。我们提出一种可解释的对抗攻击方法AutoDAN,它结合了两类攻击的优势:可自动生成绕过困惑度过滤器的攻击提示,同时保持与手动越狱攻击相当的高攻击成功率。这些提示具有可解释性与多样性,展现出手动越狱攻击常用的策略,且在训练数据有限或仅使用单一代理模型时,其迁移性优于不可读的对抗样本。我们进一步定制AutoDAN的目标函数以泄露系统提示——这是对抗攻击文献中尚未解决的另一类越狱应用,从而证明了该方法的通用性。本研究为LLM红队测试及理解越狱攻击机制提供了新途径。