Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.
翻译:越狱攻击对于识别和缓解大语言模型的安全漏洞至关重要。这类攻击旨在绕过防护机制并诱导模型输出违禁内容。然而,由于不同越狱方法之间存在显著差异,学术界缺乏统一的实现框架,这限制了全面的安全评估。本文提出EasyJailbreak——一个统一框架,旨在简化针对大语言模型的越狱攻击构建与评估流程。该框架通过四个模块化组件构建越狱攻击:选择器、变异器、约束器和评估器。这种模块化设计使研究人员能够通过组合新型与现有组件轻松构建攻击。目前,EasyJailbreak已支持11种不同的越狱方法,并实现了对多种大语言模型的安全验证。我们在10个不同的大语言模型上进行的测试表明,模型存在显著安全漏洞——在各类越狱攻击下平均突破概率达60%。值得注意的是,即便像GPT-3.5-Turbo和GPT-4这样的先进模型,其平均攻击成功率也分别达到了57%和33%。我们已为研究人员开源了丰富资源,包括网络平台、PyPI发布包、演示视频及实验数据。