Despite extensive pre-training in moral alignment to prevent generating harmful information, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense framework that filters harmful responses from LLMs. With the response-filtering mechanism, our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models. AutoDefense assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. With AutoDefense, small open-source LMs can serve as agents and defend larger models against jailbreak attacks. Our experiments show that AutoDefense can effectively defense against different jailbreak attacks, while maintaining the performance at normal user request. For example, we reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b with a 3-agent system. Our code and data are publicly available at https://github.com/XHMY/AutoDefense.
翻译:尽管经过广泛的道德对齐预训练以防止生成有害信息,大语言模型(LLMs)仍然容易受到越狱攻击。本文提出AutoDefense,一种多智能体防御框架,用于过滤大语言模型的有害响应。通过响应过滤机制,该框架能够有效抵御不同的越狱攻击提示,并可用于保护不同的受害模型。AutoDefense为LLM智能体分配不同角色,使其协同完成防御任务。任务分工增强了LLM的整体指令遵循能力,并支持将其他防御组件作为工具进行集成。借助AutoDefense,小型开源语言模型可作为智能体来保护更大模型免受越狱攻击。实验表明,AutoDefense能有效防御各类越狱攻击,同时保持正常用户请求下的性能表现。例如,使用LLaMA-2-13b构建的3智能体系统将GPT-3.5的攻击成功率从55.74%降低至7.95%。我们的代码与数据已公开于https://github.com/XHMY/AutoDefense。