Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we evaluate the impact of various attack settings on LLM performance and provide a baseline benchmark for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/Bag_of_Tricks_for_LLM_Jailbreaking.
翻译:尽管大型语言模型(LLMs)在零样本执行复杂任务方面展现出卓越能力,但其易受越狱攻击影响,可能被操纵生成有害输出。近期研究将越狱攻击主要划分为词元级和提示级两类。然而,现有研究大多忽视越狱攻击的多样化关键因素,主要聚焦于LLM的脆弱性,缺乏对防御增强型LLMs的探索。为解决这些问题,我们评估了不同攻击设置对LLM性能的影响,建立了越狱攻击的基准测试框架,以推动标准化评估体系的采用。具体而言,我们从目标层面和攻击层面系统评估了实施LLM越狱攻击的八大关键因素。我们进一步在两个广泛使用的数据集上,对六种防御方法开展了七类代表性越狱攻击实验,累计完成约354组实验,消耗约55,000 A800-80G GPU小时。实验结果凸显了建立标准化基准测试以评估防御增强型LLMs受攻击情况的必要性。相关代码已开源:https://github.com/usail-hkust/Bag_of_Tricks_for_LLM_Jailbreaking。