Despite recent advancements in Large Language Models (LLMs) and their alignment, they can still be jailbroken, i.e., harmful and toxic content can be elicited from them. While existing red-teaming methods have shown promise in uncovering such vulnerabilities, these methods struggle with limited success and high computational and monetary costs. To address this, we propose a black-box Jailbreak method with Cross-Behavior attacks (JCB), that can automatically and efficiently find successful jailbreak prompts. JCB leverages successes from past behaviors to help jailbreak new behaviors, thereby significantly improving the attack efficiency. Moreover, JCB does not rely on time- and/or cost-intensive calls to auxiliary LLMs to discover/optimize the jailbreak prompts, making it highly efficient and scalable. Comprehensive experimental evaluations show that JCB significantly outperforms related baselines, requiring up to 94% fewer queries while still achieving 12.9% higher average attack success. JCB also achieves a notably high 37% attack success rate on Llama-2-7B, one of the most resilient LLMs, and shows promising zero-shot transferability across different LLMs.
翻译:尽管大语言模型及其对齐技术近期取得了进展,但它们仍可能被越狱,即被诱导生成有害和有毒内容。虽然现有的红队测试方法在揭示此类漏洞方面显示出潜力,但这些方法存在成功率有限、计算和金钱成本高昂的问题。为解决此问题,我们提出了一种采用跨行为攻击的黑盒越狱方法,该方法能够自动且高效地找到成功的越狱提示。JCB利用过去行为的成功经验来帮助越狱新行为,从而显著提升攻击效率。此外,JCB不依赖于耗时和/或成本高昂的辅助大语言模型调用来发现/优化越狱提示,使其具有极高的效率和可扩展性。全面的实验评估表明,JCB显著优于相关基线方法,在查询次数减少高达94%的同时,仍实现了12.9%的平均攻击成功率提升。JCB在最具韧性的模型之一Llama-2-7B上实现了高达37%的攻击成功率,并展现出在不同大语言模型间良好的零样本可迁移性。