GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

翻译：大型语言模型（LLM）近期获得了极大普及，广泛应用于从日常对话到人工智能驱动的编程等领域。然而，尽管取得了显著成功，LLM并非完全可靠，可能就如何实施有害或非法活动提供详细指导。虽然安全措施可以降低此类输出的风险，但对抗性越狱攻击仍能利用LLM生成有害内容。这些越狱模板通常需要手动构建，使得大规模测试充满挑战。本文提出GPTFuzz——一种受AFL模糊测试框架启发的全新黑盒越狱模糊测试框架。不同于人工设计，GPTFuzz能够自动化生成用于红队测试LLM的越狱模板。其核心机制以人工编写模板作为初始种子，通过变异操作生成新模板。我们详细阐述了GPTFuzz的三个关键组件：用于平衡效率与多样性的种子选择策略、用于生成语义等价或相似句子的变异算子，以及评估越狱攻击成功率的判断模型。我们在多种攻击场景下对GPTFuzz进行了评估，涵盖ChatGPT、LLaMa-2和Vicuna等商用及开源LLM。结果表明，GPTFuzz能够持续生成高成功率的越狱模板，超越人工设计模板。值得注意的是，即使使用次优初始种子模板，GPTFuzz对ChatGPT和Llama-2模型仍能达到90%以上的攻击成功率。我们预期GPTFuzz将帮助研究人员和从业者检验LLM鲁棒性，并推动进一步提升LLM安全性的相关探索。