GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

翻译：大语言模型（LLMs）近年来获得了极大的普及，从日常对话到AI驱动的编程均得到广泛应用。然而，尽管取得了显著成功，LLMs并非完全可靠，可能就实施有害或非法活动提供详细指导。虽然安全措施可以降低此类输出的风险，但对抗性越狱攻击仍可利用LLMs生成有害内容。这些越狱模板通常为人工构建，使得大规模测试面临挑战。本文提出GPTFuzz——一个受AFL模糊测试框架启发的新型黑盒越狱模糊测试框架。GPTFuzz通过自动化生成越狱模板来实现对大语言模型的红队测试，而非依赖人工工程。其核心机制以人工编写的模板作为初始种子，通过变异操作生成新模板。我们详细阐述了GPTFuzz的三个关键组件：平衡效率与多样性的种子选择策略、生成语义等价或相似句子的变异算子，以及评估越狱攻击成功率的判决模型。我们在多种攻击场景下对GPTFuzz进行了评估，测试对象包括ChatGPT、LLaMa-2和Vicuna等商业及开源大语言模型。实验结果表明，GPTFuzz能持续生成具有高成功率的越狱模板，其性能超越人工构建的模板。值得注意的是，即使在初始种子模板非最优的情况下，GPTFuzz对ChatGPT和Llama-2模型的攻击成功率仍超过90%。我们预期GPTFuzz将为研究者和实践者评估大语言模型鲁棒性提供重要工具，并将推动大语言模型安全增强领域的进一步探索。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日