While Large Language Models (LLMs) have achieved tremendous success in various applications, they are also susceptible to jailbreaking attacks. Several primary defense strategies have been proposed to protect LLMs from producing harmful information, mostly focusing on model fine-tuning or heuristical defense designs. However, how to achieve intrinsic robustness through prompt optimization remains an open problem. In this paper, motivated by adversarial training paradigms for achieving reliable robustness, we propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix. To achieve our defense goal whilst maintaining natural performance, we optimize the control prompt with both adversarial and benign prompts. Comprehensive experiments show that our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0%, while maintaining the model's utility on the benign task and incurring only negligible computational overhead, charting a new perspective for future explorations in LLM security. Our code is available at https://github.com/PKU-ML/PAT.
翻译:尽管大型语言模型(LLM)在各种应用中取得了巨大成功,但它们也容易受到越狱攻击。目前已提出若干主要防御策略来保护LLM免于生成有害信息,这些策略大多侧重于模型微调或启发式防御设计。然而,如何通过提示优化实现内在鲁棒性仍是一个开放性问题。本文受实现可靠鲁棒性的对抗训练范式启发,提出一种名为提示对抗调优(PAT)的方法,该方法训练一个附加在用户提示前的提示控制器作为防护前缀。为实现防御目标同时保持自然性能,我们使用对抗性提示和良性提示共同优化控制提示。综合实验表明,我们的方法对灰盒和黑盒攻击均有效,能将先进攻击的成功率降至接近0%,同时保持模型在良性任务上的实用性,且仅产生可忽略的计算开销,为未来LLM安全探索开辟了新视角。我们的代码公开于 https://github.com/PKU-ML/PAT。