Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. Common users of AI models may not have advanced knowledge of adversarial machine learning methods or access to model internals, and they do not spend a lot of time crafting a single highly effective adversarial prompt. Instead, they are likely to make use of techniques commonly shared online and exploit the multiturn conversational nature of LLMs. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general-purpose model in a way that encourages reasoning through the choices of methods available, the current target model's response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama 3.1 and 88% against GPT-4 on the JailbreakBench dataset.

翻译：红队测试旨在评估大型语言模型（LLM）如何生成违反其安全训练中所设规范、政策与规则的内容。然而，现有文献中的多数自动化方法未能充分反映人类与AI模型的典型交互方式。AI模型的普通用户通常不具备对抗性机器学习方法的深入知识，也无法访问模型内部参数，且不会耗费大量时间精心设计单一高效对抗性提示。相反，他们更倾向于利用网络广泛传播的技术，并发挥LLM多轮对话的特性。尽管人工测试能够弥补这一不足，但其效率低下且成本高昂。为突破这些局限，我们提出了生成式攻击代理测试器（GOAT）——一种模拟自然语言对抗对话的自动化代理红队测试系统，该系统通过整合多种对抗性提示技术来识别LLM的潜在漏洞。我们通过提示通用模型构建了GOAT的7种红队攻击实例，该设计能激励系统基于可用方法选择、当前目标模型响应及后续步骤进行推理。本方法具备可扩展性与高效性，使人工测试者能专注于探索新的风险领域，而自动化系统则对已知风险领域进行规模化对抗压力测试。我们展示了GOAT的设计与评估结果，证明其在识别前沿LLM漏洞方面的有效性：在JailbreakBench数据集上，对Llama 3.1的ASR@10达到97%，对GPT-4达到88%。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日