Aligning large language models (LLMs) with human values, particularly in the face of stealthy and complex jailbreaks, presents a formidable challenge. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis Prompting (IAPrompt). The principle behind is to trigger LLMs' inherent self-correct and improve ability through a two-stage process: 1) essential intention analysis, and 2) policy-aligned response. Notably, IAPrompt is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness. Extensive experiments on SAP200 and DAN benchmarks across Vicuna, ChatGLM, MPT, DeepSeek, and GPT-3.5 show that IAPrompt could consistently and significantly reduce the harmfulness in response (averagely -46.5% attack success rate) and maintain the general helpfulness. Further analyses present some insights into how our method works. To facilitate reproducibility, We release our code and scripts at: https://github.com/alphadl/SafeLLM_with_IntentionAnalysis
翻译:将大型语言模型(LLMs)与人类价值观对齐,特别是在面对隐蔽且复杂的越狱攻击时,是一项艰巨的挑战。在本研究中,我们提出了一种简单且高效的防御策略,即意图分析提示(IAPrompt)。其核心原理是通过两阶段过程触发LLMs固有的自我修正与改进能力:1)基本意图分析;2)符合策略的回应。值得注意的是,IAPrompt是一种仅需推理的方法,因此能在不损害LLMs有用性的前提下增强其安全性。在SAP200和DAN基准测试上,针对Vicuna、ChatGLM、MPT、DeepSeek和GPT-3.5的广泛实验表明,IAPrompt能够持续且显著地降低响应的危害性(平均攻击成功率降低46.5%),同时保持通用有用性。进一步分析揭示了该方法的工作机理。为促进可复现性,我们已在https://github.com/alphadl/SafeLLM_with_IntentionAnalysis 开源了代码与脚本。