Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced $\textbf{JailTrickBench}$ to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/JailTrickBench.
翻译:尽管大语言模型(LLMs)在零样本执行复杂任务方面展现出显著能力,但它们容易受到越狱攻击,可能被操纵产生有害输出。近期,越来越多的研究将越狱攻击分为令牌级和提示级攻击。然而,先前工作主要忽视了越狱攻击的多样化关键因素,多数研究聚焦于大语言模型的脆弱性,缺乏对防御增强型大语言模型的探索。为解决这些问题,我们引入了 $\textbf{JailTrickBench}$ 来评估不同攻击设置对大语言模型性能的影响,并为越狱攻击提供基准,鼓励采用标准化评估框架。具体而言,我们从目标层面和攻击层面评估了在大语言模型上实施越狱攻击的八个关键因素。我们进一步在两个广泛使用的数据集上,对六种防御方法进行了七种代表性越狱攻击测试,涵盖约354项实验,在A800-80G显卡上消耗约55,000 GPU小时。我们的实验结果凸显了标准化基准测试的必要性,以评估这些攻击对防御增强型大语言模型的影响。我们的代码可在 https://github.com/usail-hkust/JailTrickBench 获取。