Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit meticulously crafted prompts to elicit content that violates service guidelines, have captured the attention of research communities. While model owners can defend against individual jailbreak prompts through safety training strategies, this relatively passive approach struggles to handle the broader category of similar jailbreaks. To tackle this issue, we introduce FuzzLLM, an automated fuzzing framework designed to proactively test and discover jailbreak vulnerabilities in LLMs. We utilize templates to capture the structural integrity of a prompt and isolate key features of a jailbreak class as constraints. By integrating different base classes into powerful combo attacks and varying the elements of constraints and prohibited questions, FuzzLLM enables efficient testing with reduced manual effort. Extensive experiments demonstrate FuzzLLM's effectiveness and comprehensiveness in vulnerability discovery across various LLMs.
翻译:大型语言模型中的越狱漏洞利用精心设计的提示词诱导模型生成违反服务指南的内容,已引起研究界的广泛关注。尽管模型所有者可以通过安全训练策略防御单个越狱提示词,但这种相对被动的方式难以应对同类越狱行为的广泛变体。为解决此问题,我们提出FuzzLLM——一种自动化模糊测试框架,旨在主动测试并发现大型语言模型中的越狱漏洞。我们利用模板捕捉提示词的结构完整性,并将越狱类别的关键特征提取为约束条件。通过将不同基础类别整合为强大的组合攻击,并变化约束条件与禁令问题的元素,FuzzLLM能够以最小化人工投入实现高效测试。大量实验证明,FuzzLLM在各类大型语言模型的漏洞发现中具有有效性和全面性。