Although Large Language Models (LLMs) have achieved tremendous success in various applications, they are also susceptible to certain prompts that can induce them to bypass built-in safety measures and provide dangerous or illegal content, a phenomenon known as jailbreak. To protect LLMs from producing harmful information, various defense strategies are proposed, with most focusing on content filtering or adversarial training of models. In this paper, we propose an approach named Prompt Adversarial Tuning (PAT) to train a defense control mechanism, which is then embedded as a prefix to user prompts to implement our defense strategy. We design a training process similar to adversarial training to achieve our optimized goal, alternating between updating attack and defense controls. To our knowledge, we are the first to implement defense from the perspective of prompt tuning. Once employed, our method will hardly impact the operational efficiency of LLMs. Experiments show that our method is effective in both black-box and white-box settings, reducing the success rate of advanced attacks to nearly 0 while maintaining the benign answer rate of 80% to simple benign questions. Our work might potentially chart a new perspective for future explorations in LLM security.
翻译:尽管大型语言模型(LLMs)在各种应用中取得了巨大成功,但它们也容易受到某些提示的影响,这些提示可能诱使它们绕过内置安全措施并生成危险或非法内容,这种现象被称为越狱。为了保护LLMs不产生有害信息,研究者提出了多种防御策略,其中大多数集中在内容过滤或模型的对抗训练上。本文提出了一种名为提示对抗调优(PAT)的方法,用于训练防御控制机制,并将其作为前缀嵌入用户提示中,以实现防御策略。我们设计了一个类似对抗训练的训练过程,交替更新攻击和防御控制,以达到优化目标。据我们所知,我们是首个从提示调优角度实现防御的研究。一旦部署,我们的方法几乎不会影响LLMs的运行效率。实验表明,我们的方法在黑盒和白盒设置下均有效,能将高级攻击的成功率降至接近0,同时将简单良性问题的良性回答率维持在80%。我们的工作可能为未来LLM安全探索开辟新的视角。