JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models

Jailbreak attacks aim to induce Large Language Models (LLMs) to generate harmful responses for forbidden instructions, presenting severe misuse threats to LLMs. Up to now, research into jailbreak attacks and defenses is emerging, however, there is (surprisingly) no consensus on how to evaluate whether a jailbreak attempt is successful. In other words, the methods to assess the harmfulness of an LLM's response are varied, such as manual annotation or prompting GPT-4 in specific ways. Each approach has its own set of strengths and weaknesses, impacting their alignment with human values, as well as the time and financial cost. This diversity in evaluation presents challenges for researchers in choosing suitable evaluation methods and conducting fair comparisons across different jailbreak attacks and defenses. In this paper, we conduct a comprehensive analysis of jailbreak evaluation methodologies, drawing from nearly ninety jailbreak research released between May 2023 and April 2024. Our study introduces a systematic taxonomy of jailbreak evaluators, offering in-depth insights into their strengths and weaknesses, along with the current status of their adaptation. Moreover, to facilitate subsequent research, we propose JailbreakEval, a user-friendly toolkit focusing on the evaluation of jailbreak attempts. It includes various well-known evaluators out-of-the-box, so that users can obtain evaluation results with only a single command. JailbreakEval also allows users to customize their own evaluation workflow in a unified framework with the ease of development and comparison. In summary, we regard JailbreakEval to be a catalyst that simplifies the evaluation process in jailbreak research and fosters an inclusive standard for jailbreak evaluation within the community.

翻译：越狱攻击旨在诱导大型语言模型（LLM）针对被禁止的指令生成有害响应，这对LLM构成了严重的滥用威胁。迄今为止，针对越狱攻击与防御的研究不断涌现，然而（令人惊讶的是）关于如何评估越狱尝试是否成功尚未形成共识。换言之，评估LLM响应有害性的方法多种多样，例如人工标注或以特定方式提示GPT-4。每种方法都有其独特的优势与局限性，影响着其与人类价值观的契合度，以及时间和经济成本。这种评估方式的多样性给研究人员选择合适的评估方法并进行跨不同越狱攻击与防御的公平比较带来了挑战。本文基于2023年5月至2024年4月期间发布的近九十项越狱研究，对越狱评估方法进行了全面分析。我们的研究提出了越狱评估器的系统分类法，深入剖析了其优势与不足以及当前适配现状。此外，为促进后续研究，我们提出了JailbreakEval——一个专注于评估越狱尝试的易用工具包。该工具包开箱即用集成多种知名评估器，用户仅需单条命令即可获得评估结果。JailbreakEval还允许用户在统一框架中自定义评估流程，兼具开发便捷性与对比便利性。总而言之，我们认为JailbreakEval能够作为催化剂，简化越狱研究中的评估流程，并在社区内推动建立包容性的越狱评估标准。