Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment w.r.t. the authority power for inciting harmfulness, we disclose a lightweight method, termed DeepInception, which can easily hypnotize LLM to be a jailbreaker. Specifically, DeepInception leverages the personification ability of LLM to construct a novel nested scene to behave, which realizes an adaptive way to escape the usage control in a normal scenario. Empirically, our DeepInception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open and closed-source LLMs like Falcon, Vicuna-v1.5, Llama-2, and GPT-3.5-turbo/4. Our investigation appeals to people to pay more attention to the safety aspects of LLMs and develop a stronger defense against their misuse risks. The code is publicly available at: https://github.com/tmlr-group/DeepInception.
翻译:尽管大型语言模型(LLMs)在各类应用中取得了显著成功,但它们易受对抗性越狱攻击,导致安全防护机制失效。然而,以往的越狱研究通常依赖于暴力优化或高计算成本的推断,这可能缺乏实用性或有效性。本文受米尔格拉姆实验中权威力量激发危害行为的启发,提出了一种轻量级方法——DeepInception,能够轻易催眠LLM成为越狱者。具体而言,DeepInception利用LLM的拟人化能力构建一个新颖的嵌套场景来引导其行为,从而自适应地逃避正常场景下的使用限制。实验表明,我们的DeepInception能够与以往方法达到竞争性的越狱成功率,并在后续交互中实现持续越狱,这揭示了开源及闭源LLM(如Falcon、Vicuna-v1.5、Llama-2 和 GPT-3.5-turbo/4)在自我约束上的关键弱点。本项研究呼吁人们更加关注LLM的安全方面,并开发更强的防御措施以防范其滥用风险。代码已公开于:https://github.com/tmlr-group/DeepInception。