Rethinking How to Evaluate Language Model Jailbreak

Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.

翻译：大型语言模型（LLMs）已日益融入各类应用。为确保LLMs不生成不安全响应，它们被配置了安全防护措施，以明确受限内容的范围。然而，这种安全防护可通过一种常被称为“越狱”的技术被绕过，从而生成被禁止的内容。目前已提出多种系统来自动执行越狱攻击。这些系统依赖评估方法来判断越狱尝试是否成功。然而，我们的分析揭示当前越狱评估方法存在两个局限性：（1）其目标缺乏清晰性，且与识别不安全响应的目标不一致；（2）它们将越狱结果过度简化为二元结果（成功或失败）。本文提出三个指标——安全防护违反、信息量及相对真实性——来评估语言模型越狱行为。此外，我们展示了这些指标如何与不同恶意行为者的目标相关联。为计算这些指标，我们引入了一种多维度方法，该方法在预处理响应后扩展了自然语言生成评估方法。我们在由三个恶意意图数据集和三个越狱系统生成的基准数据集上评估了我们的指标。该基准数据集由三位标注员进行标注。我们将多维度方法与三种现有越狱评估方法进行了比较。实验表明，我们的多维度评估性能优于现有方法，F1分数相较现有基线平均提升17%。我们的研究结果推动了从二元视角看待越狱问题的必要性，并呼吁采用更全面的评估来确保语言模型的安全性。