Universal and Transferable Adversarial Attacks on Aligned Language Models

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

翻译：由于“开箱即用”的大型语言模型能够生成大量不良内容，近期研究聚焦于对这些模型进行对齐，以防止不良内容生成。尽管已有一些方法成功绕过了这些防护措施（即针对LLM的“越狱”攻击），但这些攻击需要大量人工智慧且实际应用时脆弱易失效。本文提出一种简单有效的攻击方法，可诱导对齐语言模型生成不良行为。具体而言，我们的方法会找到一个后缀，当将其附加到LLM的大量查询中（要求生成不良内容）时，该后缀旨在最大化模型生成肯定回复（而非拒绝回答）的概率。然而，与依赖人工设计不同，我们的方法通过结合贪心与梯度搜索技术自动生成这些对抗性后缀，并改进了既往的自动提示生成方法。令人惊讶的是，我们生成的对抗性提示具有高度可迁移性，甚至能攻击到黑盒公开LLM。具体地，我们在多个提示（即请求不同类型不良内容的查询）和多个模型（本文使用Vicuna-7B和13B）上训练对抗性攻击后缀。实验表明，该攻击后缀能诱导ChatGPT、Bard、Claude等公开接口以及LLaMA-2-Chat、Pythia、Falcon等开源LLM生成不良内容。总体而言，本研究显著推进了针对对齐语言模型对抗攻击的前沿技术，并提出关键问题：如何有效阻止这类系统生成不良信息。代码开源地址：github.com/llm-attacks/llm-attacks。