While significant attention has been dedicated to exploiting weaknesses in LLMs through jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We point out a pivotal factor contributing to the success of jailbreaks: the intrinsic conflict between the goals of being helpful and ensuring safety. Accordingly, we propose to integrate goal prioritization at both training and inference stages to counteract. Implementing goal prioritization during inference substantially diminishes the Attack Success Rate (ASR) of jailbreaking from 66.4% to 3.6% for ChatGPT. And integrating goal prioritization into model training reduces the ASR from 71.0% to 6.6% for Llama2-13B. Remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the ASR by half. Additionally, our findings reveal that while stronger LLMs face greater safety risks, they also possess a greater capacity to be steered towards defending against such attacks, both because of their stronger ability in instruction following. Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs' capability and safety. Our code is available at \url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}.
翻译:尽管已有大量研究致力于通过越狱攻击利用大型语言模型(LLM)的弱点,但在防御此类攻击方面的努力仍显不足。我们指出导致越狱成功的一个关键因素:模型在"提供帮助"与"确保安全"这两个目标之间存在内在冲突。为此,我们提出在训练和推理阶段集成目标优先级设定以进行防御。在推理阶段实施目标优先级设定后,ChatGPT的越狱攻击成功率(ASR)从66.4%显著降至3.6%。而在模型训练中集成目标优先级设定,则使Llama2-13B的ASR从71.0%降至6.6%。值得注意的是,即使在训练过程中未包含任何越狱样本的情况下,我们的方法仍能将ASR降低一半。此外,我们的研究还发现,虽然能力更强的LLM面临更大的安全风险,但由于其遵循指令的能力更强,它们也具备更强的可引导性来防御此类攻击。因此,我们的工作不仅有助于理解越狱攻击与防御机制,还揭示了LLM能力与安全性之间的关系。相关代码已发布于 \url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}。