Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/thestephencasper/common_claim.

翻译：部署大型语言模型（LLM）可能因产生有害输出（如有毒或不诚实的言论）而带来风险。先前的研究引入了能诱导有害输出的工具，以识别并缓解这些风险。尽管这是确保语言模型安全的重要一步，但这些方法通常依赖于预先存在的针对不良输出的分类器。这限制了它们在需要事先精确知道有害行为类型的情况中的应用。然而，这忽略了红队测试的核心挑战：对模型可能表现出的行为形成情境化理解。此外，当这样的分类器已经存在时，红队测试的边际价值有限，因为分类器可简单直接用于过滤训练数据或模型输出。在本工作中，我们假设对手是根据高层级、抽象的不良行为规范进行操作的背景下考虑红队测试。红队需要细化/扩展这一规范，并识别出诱导模型表现该行为的方法。我们的红队测试框架包括三个步骤：1）在预期情境中探索模型的行为；2）建立对不良行为的度量（例如，训练一个能反映人类评价的分类器）；3）利用此度量及既定的红队测试方法来利用模型的缺陷。我们应用此方法对GPT-2和GPT-3模型进行红队测试，系统性地发现能诱导有毒及不诚实陈述的提示类别。在此过程中，我们还构建并发布了CommonClaim数据集，包含2万条由人类标注者标记为常识真、常识假或两者皆非的陈述。代码见https://github.com/thestephencasper/explore_establish_exploit_llms。CommonClaim数据集见https://github.com/thestephencasper/common_claim。