Large language models (LLMs) have succeeded significantly in various applications but remain susceptible to adversarial jailbreaks that void their safety guardrails. Previous attempts to exploit these vulnerabilities often rely on high-cost computational extrapolations, which may not be practical or efficient. In this paper, inspired by the authority influence demonstrated in the Milgram experiment, we present a lightweight method to take advantage of the LLMs' personification capabilities to construct $\textit{a virtual, nested scene}$, allowing it to realize an adaptive way to escape the usage control in a normal scenario. Empirically, the contents induced by our approach can achieve leading harmfulness rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open-source and closed-source LLMs, $\textit{e.g.}$, Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o. The code and data are available at: https://github.com/tmlr-group/DeepInception.
翻译:大语言模型(LLMs)在各种应用中取得了显著成功,但仍易受对抗性越狱攻击的影响,这些攻击会使其安全防护措施失效。先前利用这些漏洞的尝试通常依赖于高成本的计算外推,这可能既不实用也不高效。本文受米尔格拉姆实验中所展示的权威影响力启发,提出了一种轻量级方法,利用LLMs的拟人化能力构建一个虚拟的、嵌套的场景,使其能够实现一种自适应方式,在正常场景下逃脱使用控制。实证结果表明,我们的方法诱导生成的内容在危害性率上达到了与先前同类方法相当甚至领先的水平,并在后续交互中实现了持续的越狱。这揭示了开源与闭源LLMs(例如Llama-2、Llama-3、GPT-3.5、GPT-4和GPT-4o)均存在自我迷失的关键弱点。代码与数据可在以下网址获取:https://github.com/tmlr-group/DeepInception。