While Large Language Models (LLMs) have achieved tremendous success in various applications, they are also susceptible to jailbreak attacks. Several primary defense strategies have been proposed to protect LLMs from producing harmful information, mostly with a particular focus on harmful content filtering or heuristical defensive prompt designs. However, how to achieve intrinsic robustness through the prompts remains an open problem. In this paper, motivated by adversarial training paradigms for achieving reliable robustness, we propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix. To achieve our defense goal whilst maintaining natural performance, we optimize the control prompt with both adversarial and benign prompts. Comprehensive experiments show that our method is effective against both black-box and white-box attacks, reducing the success rate of advanced attacks to nearly 0 while maintaining the model's utility on the benign task. The proposed defense strategy incurs only negligible computational overhead, charting a new perspective for future explorations in LLM security. Our code is available at https://github.com/rain152/PAT.
翻译:尽管大型语言模型(LLM)在各种应用中取得了巨大成功,但它们也容易受到越狱攻击。目前已提出若干主要防御策略来保护LLM免于生成有害信息,这些策略大多特别关注有害内容过滤或启发式防御提示设计。然而,如何通过提示实现内在鲁棒性仍是一个开放性问题。本文受实现可靠鲁棒性的对抗训练范式启发,提出一种名为提示对抗调优(PAT)的方法,该方法通过训练附加在用户提示前的控制提示作为防护前缀。为实现防御目标同时保持自然性能,我们使用对抗性提示和良性提示共同优化控制提示。综合实验表明,我们的方法对黑盒和白盒攻击均有效,能将先进攻击的成功率降至接近0%,同时保持模型在良性任务上的实用性。所提出的防御策略仅产生可忽略的计算开销,为未来LLM安全探索开辟了新视角。我们的代码公开于 https://github.com/rain152/PAT。