Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment that individuals can harm another person if they are told to do so by an authoritative figure, we disclose a lightweight method, termed as DeepInception, which can easily hypnotize LLM to be a jailbreaker and unlock its misusing risks. Specifically, DeepInception leverages the personification ability of LLM to construct a novel nested scene to behave, which realizes an adaptive way to escape the usage control in a normal scenario and provides the possibility for further direct jailbreaks. Empirically, we conduct comprehensive experiments to show its efficacy. Our DeepInception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open/closed-source LLMs like Falcon, Vicuna, Llama-2, and GPT-3.5/4/4V. Our investigation appeals that people should pay more attention to the safety aspects of LLMs and a stronger defense against their misuse risks. The code is publicly available at: https://github.com/tmlr-group/DeepInception.
翻译:尽管大型语言模型(LLMs)在各类应用中取得了显著成功,但它们易受对抗性越狱攻击,导致安全防护失效。然而,以往的越狱研究通常依赖于暴力优化或高计算成本的推断方法,可能不够实用或有效。受米尔格拉姆实验(个体在权威指令下可能伤害他人)的启发,本文提出一种轻量级方法——DeepInception,可轻松催眠LLM成为越狱者,并解锁其滥用风险。具体而言,DeepInception利用LLM的拟人化能力构建新颖的嵌套场景进行行为模拟,从而自适应地逃脱正常场景下的使用控制,为直接越狱提供可能。通过综合实验验证其有效性,我们的DeepInception在越狱成功率上可与先前方法媲美,并在后续交互中实现持续越狱,揭示了Falcon、Vicuna、Llama-2及GPT-3.5/4/4V等开源/闭源LLM在自我防御上的关键缺陷。本研究呼吁人们更关注LLM的安全性问题,并加强对其滥用风险的防御措施。代码已公开于:https://github.com/tmlr-group/DeepInception。