Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.
翻译:尽管大语言模型的安全对齐技术发展迅速,但防御多轮越狱攻击仍是一项具有挑战性的任务。本文通过全面比较发现,现有的一些防御方法虽能提升大语言模型抵抗多轮越狱攻击的鲁棒性,却会损害其可用性,即降低通用能力或引发过度拒绝问题。从大语言模型机制可解释性的角度分析,我们发现这些方法未能建立能够精确区分安全与有害特征表示的边界。因此,靠近有害表示的边界安全表示不可避免地受到干扰,导致可用性下降。为解决这一问题,我们提出X-Boundary方法,通过将有害表示推离边界安全表示,从而获得精确的区分边界。这种方式可以精准消除有害表示而不干扰安全表示。实验结果表明,X-Boundary在多轮越狱防御中实现了最先进的性能,同时将过度拒绝率降低约20%,并保持近乎完整的通用能力。此外,我们从理论上证明并通过实验验证,X-Boundary能够加速训练过程中的收敛进程。代码详见:https://github.com/AI45Lab/X-Boundary。