Aligning large language models (LLMs) with human values, particularly in the face of complex and stealthy jailbreak attacks, presents a formidable challenge. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). The principle behind this is to trigger LLMs' inherent self-correct and improve ability through a two-stage process: 1) essential intention analysis, and 2) policy-aligned response. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across ChatGLM, LLaMA2, Vicuna, MPT, DeepSeek, and GPT-3.5 show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -53.1% attack success rate) and maintain the general helpfulness. Encouragingly, with the help of our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 in terms of attack success rate. Further analyses present some insights into how our method works. To facilitate reproducibility, we release our code and scripts at: https://github.com/alphadl/SafeLLM_with_IntentionAnalysis.
翻译:在大语言模型(LLMs)中实现与人类价值观的对齐,尤其是在应对复杂且隐蔽的越狱攻击时,是一项严峻的挑战。在本研究中,我们提出了一种简单却高效的防御策略,即意图分析($\mathbb{IA}$)。其核心原理是通过两个阶段触发LLMs固有的自我修正与能力提升:1)本质意图分析,以及2)策略对齐式回应。值得注意的是,$\mathbb{IA}$是一种仅推理阶段的方法,因此能在增强LLMs安全性的同时,不损害其有用性。在ChatGLM、LLaMA2、Vicuna、MPT、DeepSeek和GPT-3.5等模型上,基于多种越狱基准测试的大量实验表明,$\mathbb{IA}$能够持续且显著地降低回应中的有害性(平均攻击成功率下降-53.1%),并保持整体有用性。令人鼓舞的是,借助我们提出的$\mathbb{IA}$,Vicuna-7B在攻击成功率指标上甚至超越了GPT-3.5。进一步的分析揭示了该方法的工作机制。为促进可复现性,我们将代码与脚本开源至:https://github.com/alphadl/SafeLLM_with_IntentionAnalysis。