Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/Algorithmic-Alignment-Lab/CommonClaim.

翻译：部署大型语言模型（LLMs）可能因有害输出（如毒化或虚假言论）带来风险。已有研究引入了工具来引发有害输出，以识别并缓解这些风险。尽管这是确保语言模型安全的重要一步，但这些方法通常依赖于预先存在的针对不良输出的分类器。这限制了它们在实际场景中的应用，即需要事先精确了解有害行为类型的情况。然而，这回避了红队测试的核心挑战：建立对模型可能表现出的行为的上下文理解。此外，当此类分类器已存在时，红队测试的边际价值有限，因为分类器本身可直接用于过滤训练数据或模型输出。在本工作中，我们假设攻击者基于高层次、抽象的不良行为规范进行操作，并在此前提下考虑红队测试。红队需要细化/扩展此规范，并识别从模型中引发此类行为的方法。我们的红队测试框架包含三个步骤：1）探索模型在目标上下文中的行为；2）建立不良行为的度量标准（例如，训练用于反映人类评估的分类器）；3）利用此度量标准和已建立的红队测试方法论来利用模型的缺陷。我们将此方法应用于对GPT-2和GPT-3模型进行红队测试，系统性地发现了能引发毒化和虚假陈述的提示类别。在此过程中，我们还构建并发布了CommonClaim数据集，包含20,000条由人类标注者标记为常识真、常识假或两者皆非的陈述。代码见https://github.com/thestephencasper/explore_establish_exploit_llms。CommonClaim见https://github.com/Algorithmic-Alignment-Lab/CommonClaim。