RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

Recently, advanced Large Language Models (LLMs) such as GPT-4 have been integrated into many real-world applications like Code Copilot. These applications have significantly expanded the attack surface of LLMs, exposing them to a variety of threats. Among them, jailbreak attacks that induce toxic responses through jailbreak prompts have raised critical safety concerns. To identify these threats, a growing number of red teaming approaches simulate potential adversarial scenarios by crafting jailbreak prompts to test the target LLM. However, existing red teaming methods do not consider the unique vulnerabilities of LLM in different scenarios, making it difficult to adjust the jailbreak prompts to find context-specific vulnerabilities. Meanwhile, these methods are limited to refining jailbreak templates using a few mutation operations, lacking the automation and scalability to adapt to different scenarios. To enable context-aware and efficient red teaming, we abstract and model existing attacks into a coherent concept called "jailbreak strategy" and propose a multi-agent LLM system named RedAgent that leverages these strategies to generate context-aware jailbreak prompts. By self-reflecting on contextual feedback in an additional memory buffer, RedAgent continuously learns how to leverage these strategies to achieve effective jailbreaks in specific contexts. Extensive experiments demonstrate that our system can jailbreak most black-box LLMs in just five queries, improving the efficiency of existing red teaming methods by two times. Additionally, RedAgent can jailbreak customized LLM applications more efficiently. By generating context-aware jailbreak prompts towards applications on GPTs, we discover 60 severe vulnerabilities of these real-world applications with only two queries per vulnerability. We have reported all found issues and communicated with OpenAI and Meta for bug fixes.

翻译：近年来，GPT-4等先进大语言模型（LLMs）已被集成至诸多实际应用（如Code Copilot）中。这些应用显著扩大了LLMs的攻击面，使其面临多种威胁。其中，通过越狱提示诱导有害输出的越狱攻击引发了严重的安全担忧。为识别此类威胁，越来越多的红队测试方法通过构造越狱提示来模拟潜在对抗场景以测试目标LLM。然而，现有红队方法未能考虑LLM在不同场景下的独特脆弱性，导致难以调整越狱提示以发现特定上下文中的漏洞。同时，这些方法仅限于通过少量变异操作优化越狱模板，缺乏适应不同场景的自动化能力与可扩展性。为实现上下文感知的高效红队测试，我们将现有攻击抽象建模为统一的“越狱策略”概念，并提出名为RedAgent的多智能体LLM系统，该系统利用这些策略生成上下文感知的越狱提示。通过在额外记忆缓冲区中对上下文反馈进行自反思，RedAgent持续学习如何运用这些策略在特定上下文中实现有效越狱。大量实验表明，我们的系统仅需五次查询即可攻破大多数黑盒LLMs，将现有红队测试方法的效率提升两倍。此外，RedAgent能更高效地攻破定制化LLM应用：通过针对GPTs上的应用生成上下文感知的越狱提示，我们仅需对每个漏洞进行两次查询，便发现了这些实际应用中的60个严重漏洞。我们已报告所有发现的问题，并与OpenAI和Meta沟通以推动漏洞修复。