Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs). A considerable amount of research exists proposing more effective jailbreak attacks, including the recent Greedy Coordinate Gradient (GCG) attack, jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and multilingual jailbreak. In contrast, the defensive side has been relatively less explored. This paper proposes a lightweight yet practical defense called SELFDEFEND, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts. Our key insight is that regardless of the kind of jailbreak strategies employed, they eventually need to include a harmful prompt (e.g., "how to make a bomb") in the prompt sent to LLMs, and we found that existing LLMs can effectively recognize such harmful prompts that violate their safety policies. Based on this insight, we design a shadow stack that concurrently checks whether a harmful prompt exists in the user prompt and triggers a checkpoint in the normal stack once a token of "No" or a harmful prompt is output. The latter could also generate an explainable LLM response to adversarial prompts. We demonstrate our idea of SELFDEFEND works in various jailbreak scenarios through manual analysis in GPT-3.5/4. We also list three future directions to further enhance SELFDEFEND.
翻译:越狱是一种新兴的对抗性攻击方式,可绕过商用大语言模型中部署的安全对齐机制。现有大量研究提出了更有效的越狱攻击方法,包括最近的贪婪坐标梯度攻击、基于越狱模板的攻击(如使用"Do-Anything-Now")以及多语言越狱等。相比之下,防御端的研究相对较少。本文提出一种轻量级且实用的防御方法SELFDEFEND,能抵御所有现有越狱攻击,对越狱提示的延迟极小且对正常用户提示的延迟可忽略不计。我们的核心见解在于:无论采用何种越狱策略,最终都需要在向大语言模型发送的提示中包含有害提示(例如"如何制造炸弹"),而现有大语言模型能有效识别这些违反其安全策略的有害提示。基于这一见解,我们设计了影子堆栈机制,可同时检测用户提示中是否存在有害提示,并在输出"不"标记或有害提示时触发正常堆栈中的检查点。该机制还能为对抗性提示生成可解释的大语言模型响应。通过在GPT-3.5/4上的人工分析,我们验证了SELFDEFEND理念在多种越狱场景中的有效性。最后,我们提出三个未来方向以进一步优化SELFDEFEND。