Typical schemes for automated red-teaming large language models (LLMs) focus on discovering prompts that trigger a frozen language model (the defender) to generate toxic text. This often results in the prompting model (the adversary) producing text that is unintelligible and unlikely to arise. Here, we propose a reinforcement learning formulation of the LLM red-teaming task which allows us to discover prompts that both (1) trigger toxic outputs from a frozen defender and (2) have low perplexity as scored by the defender. We argue these cases are most pertinent in a red-teaming setting because of their likelihood to arise during normal use of the defender model. We solve this formulation through a novel online and weakly supervised variant of Identity Preference Optimization (IPO) on GPT-2 and GPT-2 XL defenders. We demonstrate that our policy is capable of generating likely prompts that also trigger toxicity. Finally, we qualitatively analyze learned strategies, trade-offs of likelihood and toxicity, and discuss implications. Source code is available for this project at: https://github.com/sisl/ASTPrompter/.
翻译:典型的自动化红队测试大型语言模型(LLM)方案侧重于发现能触发冻结语言模型(防御方)生成有害文本的提示。这通常导致提示模型(攻击方)产生难以理解且不太可能自然出现的文本。本文提出一种基于强化学习的LLM红队测试任务建模方法,使我们能够发现同时满足以下两个条件的提示:(1)能触发冻结防御方生成有害输出;(2)在防御方评分下具有较低困惑度。我们认为这些情况在红队测试场景中最具实际意义,因为它们最可能在防御模型的正常使用过程中出现。我们通过在GPT-2和GPT-2 XL防御模型上实施新颖的在线弱监督版身份偏好优化(IPO)算法来解决该问题。实验表明,我们的策略能够生成既具有高似然性又能触发毒性的提示。最后,我们对学习策略、似然性与毒性的权衡关系进行定性分析,并讨论其实际意义。本项目源代码位于:https://github.com/sisl/ASTPrompter/。