Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.
翻译:红队测试(即识别能够引发有害响应的提示)是确保大语言模型安全可靠部署的关键环节。要针对多种攻击模式建立有效防护,首先需要发现多样化的攻击方式。自动化红队测试通常采用强化学习方法微调攻击者语言模型,使其生成能诱使目标大语言模型产生不良响应的提示——例如通过辅助毒性分类器进行量化评估。研究表明,即使采用显式正则化来促进新颖性与多样性,现有方法仍存在模式坍缩问题或无法生成有效攻击。作为一种灵活且概率原理严谨的替代方案,我们提出采用GFlowNet微调结合后续平滑阶段的方法,训练攻击者模型以生成多样化且高效攻击提示。实验证明,本方法生成的攻击提示能有效作用于多种目标大语言模型(无论是否经过安全调优),并在不同目标模型间展现出良好的迁移能力。最后,我们验证了使用本方法生成的红队测试提示数据集进行安全调优的模型,对其他基于强化学习的红队测试方法产生的攻击具有显著鲁棒性。