Recent advances in large language models (LLMs) have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Project details and code can be found at https://andrewzh112.github.io/#diverct.
翻译:大型语言模型(LLM)的最新进展使其变得不可或缺,同时也引发了对其安全性管理的重大关切。自动化红队测试为劳动密集且易出错的人工漏洞探测提供了一种有前景的替代方案,能够提供更一致和可扩展的安全性评估。然而,现有方法通常因专注于最大化攻击成功率而牺牲了多样性。此外,那些通过语义多样性奖励来降低与历史嵌入余弦相似度的方法,会随着历史记录的增长而导致新颖性停滞。为解决这些问题,我们提出了DiveR-CT,该方法放宽了对目标和语义奖励的传统约束,赋予策略更大的自由度以增强多样性。我们的实验表明,DiveR-CT相对于基线方法具有显著优势,具体体现在:1) 在不同攻击成功率水平下,生成的数据在各种多样性指标上表现更优;2) 基于收集数据进行安全调优后,能更好地增强蓝队模型的韧性;3) 允许动态控制目标权重,以实现可靠且可控的攻击成功率;4) 降低了对奖励过度优化的敏感性。项目详情与代码可在 https://andrewzh112.github.io/#diverct 查看。