Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. Common users of AI models may not have advanced knowledge of adversarial machine learning methods or access to model internals, and they do not spend a lot of time crafting a single highly effective adversarial prompt. Instead, they are likely to make use of techniques commonly shared online and exploit the multiturn conversational nature of LLMs. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general-purpose model in a way that encourages reasoning through the choices of methods available, the current target model's response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama 3.1 and 88% against GPT-4 on the JailbreakBench dataset.
翻译:红队测试旨在评估大型语言模型(LLM)如何生成违反其安全训练中所设规范、政策与规则的内容。然而,现有文献中的多数自动化方法未能充分反映人类与AI模型的典型交互方式。AI模型的普通用户通常不具备对抗性机器学习方法的深入知识,也无法访问模型内部参数,且不会耗费大量时间精心设计单一高效对抗性提示。相反,他们更倾向于利用网络广泛传播的技术,并发挥LLM多轮对话的特性。尽管人工测试能够弥补这一不足,但其效率低下且成本高昂。为突破这些局限,我们提出了生成式攻击代理测试器(GOAT)——一种模拟自然语言对抗对话的自动化代理红队测试系统,该系统通过整合多种对抗性提示技术来识别LLM的潜在漏洞。我们通过提示通用模型构建了GOAT的7种红队攻击实例,该设计能激励系统基于可用方法选择、当前目标模型响应及后续步骤进行推理。本方法具备可扩展性与高效性,使人工测试者能专注于探索新的风险领域,而自动化系统则对已知风险领域进行规模化对抗压力测试。我们展示了GOAT的设计与评估结果,证明其在识别前沿LLM漏洞方面的有效性:在JailbreakBench数据集上,对Llama 3.1的ASR@10达到97%,对GPT-4达到88%。