Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment that individuals can harm another person if they are told to do so by an authoritative figure, we disclose a lightweight method, termed as DeepInception, which can easily hypnotize LLM to be a jailbreaker and unlock its misusing risks. Specifically, DeepInception leverages the personification ability of LLM to construct a novel nested scene to behave, which realizes an adaptive way to escape the usage control in a normal scenario and provides the possibility for further direct jailbreaks. Empirically, we conduct comprehensive experiments to show its efficacy. Our DeepInception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open/closed-source LLMs like Falcon, Vicuna, Llama-2, and GPT-3.5/4/4V. Our investigation appeals that people should pay more attention to the safety aspects of LLMs and a stronger defense against their misuse risks. The code is publicly available at: https://github.com/tmlr-group/DeepInception.
翻译:尽管大语言模型(LLMs)在各类应用中取得了显著成功,但其易受对抗性越狱攻击的缺陷导致安全防护机制失效。现有越狱研究多采用暴力优化或高计算成本的推演方法,往往缺乏实用性与有效性。受米尔格拉姆实验中个体服从权威指示伤害他人的现象启发,本文提出一种轻量级方法——DeepInception,能轻易催眠LLM成为越狱者并释放其滥用风险。具体而言,DeepInception利用LLM的拟人化能力构建新型嵌套场景,自适应地规避正常场景下的使用限制,为后续直接越狱提供可能。通过全面实验验证有效性,我们的DeepInception在越狱成功率上可与先前方法媲美,并在连续交互中实现持续越狱,揭示了Falcon、Vicuna、Llama-2及GPT-3.5/4/4V等开源/闭源LLM在自我约束机制上的关键缺陷。本研究呼吁学界更多关注LLM的安全性问题,并建立更强效的防御机制以应对滥用风险。代码已开源:https://github.com/tmlr-group/DeepInception。