Misuse of the Large Language Models (LLMs) has raised widespread concern. To address this issue, safeguards have been taken to ensure that LLMs align with social ethics. However, recent findings have revealed an unsettling vulnerability bypassing the safeguards of LLMs, known as jailbreak attacks. By applying techniques, such as employing role-playing scenarios, adversarial examples, or subtle subversion of safety objectives as a prompt, LLMs can produce an inappropriate or even harmful response. While researchers have studied several categories of jailbreak attacks, they have done so in isolation. To fill this gap, we present the first large-scale measurement of various jailbreak attack methods. We concentrate on 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs. Our extensive experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates, as well as exhibit robustness across different LLMs. Some jailbreak prompt datasets, available from the Internet, can also achieve high attack success rates on many LLMs, such as ChatGLM3, GPT-3.5, and PaLM2. Despite the claims from many organizations regarding the coverage of violation categories in their policies, the attack success rates from these categories remain high, indicating the challenges of effectively aligning LLM policies and the ability to counter jailbreak attacks. We also discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable, becoming an option for black-box models. Overall, our research highlights the necessity of evaluating different jailbreak methods. We hope our study can provide insights for future research on jailbreak attacks and serve as a benchmark tool for evaluating them for practitioners.
翻译:大型语言模型(LLMs)的滥用已引发广泛关注。为解决这一问题,业界已采取防护措施以确保LLMs符合社会伦理规范。然而,最新研究发现了一种令人不安的漏洞——通过绕过LLMs防护机制的"越狱攻击"技术,只需运用角色扮演场景、对抗性示例或巧妙颠覆安全目标等提示工程手段,即可使LLMs生成不当甚至有害的回复。尽管研究者已对多类越狱攻击进行过研究,但这些工作均孤立进行。为填补这一空白,我们首次对各类越狱攻击方法开展大规模测量研究。本研究聚焦四类共13种前沿越狱方法、16类违规领域的160个问题以及六种主流LLMs。大量实验结果表明,经优化的越狱提示词始终能实现最高攻击成功率,并在不同LLMs间展现稳健性。部分互联网上可获取的越狱提示词数据集在ChatGLM3、GPT-3.5和PaLM2等多种LLMs上同样能达到高攻击成功率。尽管众多组织声称其政策已覆盖各类违规场景,但相关领域的攻击成功率依然居高不下,这揭示了有效对齐LLM政策与抵御越狱攻击能力所面临的挑战。我们还探讨了攻击性能与效率之间的权衡关系,并证明越狱提示词的可迁移性依然可行,可成为黑盒模型的有效攻击选项。总体而言,本研究凸显了系统评估不同越狱方法的必要性。期望本研究能为未来越狱攻击研究提供洞见,并为实践者提供评估此类攻击的基准工具。