WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models.

翻译：本文提出WildTeaming，一种自动化的LLM安全红队测试框架，该框架通过挖掘真实场景中的用户-聊天机器人交互数据，发现了5.7K个独特的新型越狱策略簇，并通过组合多种策略对新型越狱攻击进行系统性探索。与以往通过招募人工测试员、基于梯度的优化或使用LLM进行迭代修订的红队测试工作相比，本研究探究的是未受特定指令诱导破坏系统的聊天机器人用户所发起的越狱攻击。WildTeaming揭示了前沿LLM此前未被识别的脆弱性，与最先进的越狱方法相比，其生成的对抗性攻击在多样性和成功率上最高可提升4.6倍。尽管目前存在众多用于越狱评估的数据集，但可用于越狱训练的开源数据集极少，因为即使模型权重已开源，安全训练数据通常仍处于封闭状态。基于WildTeaming，我们创建了WildJailbreak——一个包含26.2万条原始（直接请求）与对抗性（复杂越狱）提示-响应对的大规模开源合成安全数据集。为缓解过度安全行为，WildJailbreak提供两种对比性查询类型：1) 有害查询（原始型与对抗型）；2) 形式上类似有害查询但内容无害的良性查询。由于WildJailbreak在质量和规模上显著超越了现有安全资源，它使我们能够独特地探究数据规模效应以及安全训练过程中数据特性与模型能力间的相互作用。通过大量实验，我们确定了实现理想安全行为平衡的训练特性：适度的安全防护而非过度拒绝、有效处理原始型与对抗型查询，以及在通用能力上实现最小化（甚至零）损失。WildJailbreak的所有组成部分共同促进了模型平衡安全行为的实现。