Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2.
翻译:大语言模型因生成高质量文本而广受欢迎,但即便通过强化学习与人类价值观对齐,仍可能产生有害内容。对抗性提示可绕过其安全措施。我们提出一种名为“LLM自我防御”的简单方法,通过让大语言模型审查其诱导生成的响应来防御此类攻击。该方法无需微调、输入预处理或迭代输出生成,而是将生成内容嵌入预定义提示中,并调用另一个大语言模型实例分析文本以预测其是否具有危害性。我们在当前两大主流模型GPT 3.5与Llama 2上测试了LLM自我防御方法,涉及多种攻击类型(如强制诱导正向响应提示攻击和提示工程攻击)。值得注意的是,该方法在GPT 3.5和Llama 2上均将攻击成功率降至接近0%。