Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2. The code is publicly available at https://github.com/poloclub/llm-self-defense
翻译:大型语言模型(LLMs)因能生成高质量文本而广受欢迎,但即使通过强化学习与人类价值观对齐,仍可能产生有害内容。对抗性提示词可绕过其安全防护措施。我们提出LLM自我防御方法——一种通过让LLM筛查生成的响应来抵御此类攻击的简洁方案。该方法无需微调、输入预处理或迭代输出生成,而是将生成内容纳入预定义提示词,并借助另一个LLM实例分析文本以预测其是否有害。我们在当前最主流的两个LLM——GPT 3.5和Llama 2上测试了LLM自我防御方法,针对多种攻击类型(如强制诱导肯定回应型提示词攻击和提示词工程攻击)进行实验。值得注意的是,使用GPT 3.5和Llama 2时,LLM自我防御成功将攻击成功率降至几乎为零。相关代码已开源发布于https://github.com/poloclub/llm-self-defense。