Recent advances in large language model assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment.
翻译:近年来,大语言模型助手的发展使其变得不可或缺,同时也引发了对其安全管理的重要关切。自动化红队测试为劳动密集且易出错的人工漏洞探测提供了一种有前景的替代方案,能够实现更一致和可扩展的安全性评估。然而,现有方法通常因专注于最大化攻击成功率而牺牲了多样性。此外,那些通过语义多样性奖励来降低与历史嵌入向量余弦相似度的方法,会随着历史记录的增长而导致新颖性停滞。为解决这些问题,我们提出了DiveR-CT,该方法放宽了对目标和语义奖励的传统约束,赋予策略更大的自由度以增强多样性。我们的实验证明DiveR-CT相较于基线方法具有显著优势,具体体现在:1)在不同攻击成功率水平下,生成的数据在各种多样性指标上表现更优;2)基于收集数据进行安全调优后,能更好地增强蓝队模型的韧性;3)允许动态控制目标权重,以实现可靠且可控的攻击成功率;4)降低了对奖励过度优化的敏感性。总体而言,我们的方法为大语言模型红队测试提供了一种高效且有效的途径,加速了其在实际场景中的部署。