Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.
翻译:大语言模型(LLMs)近年来获得了极大的普及,从日常对话到AI驱动的编程均得到广泛应用。然而,尽管取得了显著成功,LLMs并非完全可靠,可能就实施有害或非法活动提供详细指导。虽然安全措施可以降低此类输出的风险,但对抗性越狱攻击仍可利用LLMs生成有害内容。这些越狱模板通常为人工构建,使得大规模测试面临挑战。本文提出GPTFuzz——一个受AFL模糊测试框架启发的新型黑盒越狱模糊测试框架。GPTFuzz通过自动化生成越狱模板来实现对大语言模型的红队测试,而非依赖人工工程。其核心机制以人工编写的模板作为初始种子,通过变异操作生成新模板。我们详细阐述了GPTFuzz的三个关键组件:平衡效率与多样性的种子选择策略、生成语义等价或相似句子的变异算子,以及评估越狱攻击成功率的判决模型。我们在多种攻击场景下对GPTFuzz进行了评估,测试对象包括ChatGPT、LLaMa-2和Vicuna等商业及开源大语言模型。实验结果表明,GPTFuzz能持续生成具有高成功率的越狱模板,其性能超越人工构建的模板。值得注意的是,即使在初始种子模板非最优的情况下,GPTFuzz对ChatGPT和Llama-2模型的攻击成功率仍超过90%。我们预期GPTFuzz将为研究者和实践者评估大语言模型鲁棒性提供重要工具,并将推动大语言模型安全增强领域的进一步探索。