The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. Inspired by the social facilitation concept, SoP generates and optimizes multiple jailbreak characters to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SoP can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SoP. Code is available at https://github.com/Yang-Yan-Yang-Yan/SoP.
翻译:大型语言模型(LLM)的广泛应用引发了对其潜在滥用的担忧。尽管在发布前已通过人类偏好数据进行对齐,LLM仍容易受到各种恶意攻击。本文采用红队策略以增强LLM安全性,并提出了SoP这一简洁而有效的自动生成越狱提示的框架。受社会助长概念的启发,SoP通过生成并优化多个越狱角色来绕过目标LLM的安全防护机制。与以往依赖专有LLM或基于人工经验设计的种子越狱模板的研究不同,SoP能够在冷启动场景下使用开源LLM生成并优化越狱提示,且无需任何种子越狱模板。实验结果表明,SoP在绕过GPT-3.5-1106和GPT-4安全对齐机制的攻击成功率分别达到88%和60%。此外,我们广泛评估了生成模板在不同LLM及预留恶意请求间的可迁移性,同时探索了针对SoP所设计越狱攻击的防御策略。代码发布于https://github.com/Yang-Yan-Yang-Yan/SoP。