The rapid development of Large Language Models (LLMs) has brought significant advancements across various tasks. However, despite these achievements, LLMs still exhibit inherent safety vulnerabilities, especially when confronted with jailbreak attacks. Existing jailbreak methods suffer from two main limitations: reliance on complicated prompt engineering and iterative optimization, which lead to low attack success rate (ASR) and attack efficiency (AE). In this work, we propose an efficient jailbreak attack method, Analyzing-based Jailbreak (ABJ), which leverages the advanced reasoning capability of LLMs to autonomously generate harmful content, revealing their underlying safety vulnerabilities during complex reasoning process. We conduct comprehensive experiments on ABJ across various open-source and closed-source LLMs. In particular, ABJ achieves high ASR (82.1% on GPT-4o-2024-11-20) with exceptional AE among all target LLMs, showcasing its remarkable attack effectiveness, transferability, and efficiency. Our findings underscore the urgent need to prioritize and improve the safety of LLMs to mitigate the risks of misuse.
翻译:大型语言模型(LLMs)的快速发展为各类任务带来了显著进步。然而,尽管取得了这些成就,LLMs仍表现出固有的安全脆弱性,尤其是在面对越狱攻击时。现有越狱方法存在两个主要局限:依赖复杂的提示工程与迭代优化,导致攻击成功率(ASR)和攻击效率(AE)较低。本研究中,我们提出了一种高效的越狱攻击方法——基于分析的越狱攻击(ABJ),该方法利用LLMs的高级推理能力自主生成有害内容,揭示其在复杂推理过程中潜在的安全漏洞。我们在多种开源与闭源LLMs上对ABJ进行了全面实验。特别地,ABJ在所有目标LLMs中实现了高攻击成功率(在GPT-4o-2024-11-20上达到82.1%)与卓越的攻击效率,展现了其显著的攻击有效性、可迁移性和高效性。我们的研究结果强调,亟需优先关注并提升LLMs的安全性以降低误用风险。