Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.
翻译:安全对齐使大语言模型(LLM)能够获得针对恶意查询的防护,但各种越狱攻击方法揭示了该安全机制的脆弱性。先前的研究将LLM越狱攻击与防御孤立对待。我们分析了LLM的安全防护机制,提出了一个融合攻击与防御的框架。我们的方法基于LLM中间层嵌入的线性可分特性,以及越狱攻击的本质——即嵌入有害问题并将其转移至安全区域。我们利用生成对抗网络(GAN)学习LLM内部的安全判定边界,以实现高效的越狱攻击与防御。实验结果表明,我们的方法在三种主流LLM上实现了平均88.85%的越狱成功率,同时在当前最先进的越狱数据集上的防御成功率平均达到84.17%。这不仅验证了我们方法的有效性,也揭示了LLM内部的安全机制,为增强模型安全性提供了新的见解。代码与数据可在https://github.com/NLPGM/CAVGAN获取。