As large language models (LLMs) are increasingly deployed as black-box components in real-world applications, red teaming has become essential for identifying potential risks. It tests LLMs with adversarial prompts to uncover vulnerabilities and improve safety alignment. Ideally, effective red teaming should be adaptive to evolving LLM capabilities and explore a broad range of harmful topics. However, existing approaches face two limitations: 1) topic-based approaches rely on pre-collected harmful topics, limited in flexibility and adaptivity. 2) topic-free methods use reinforcement learning (RL), but they lack an explicit reward signal for exploration and tend to over-optimize a narrow objective, reducing topic diversity. To address these limitations, we propose RedTopic, a novel red teaming framework that generates topic-diverse adversarial prompts through a contextualized generation pipeline, an aggregate reward design, and a multi-objective RL training loop. Experiments show that RedTopic produces more effective and diverse adversarial prompts than existing methods, with notable improvements in integrated evaluation metrics. We believe RedTopic represents a step toward more adaptive and topic-diverse red teaming for large language models.
翻译:随着大语言模型(LLMs)在现实应用中逐渐被部署为黑箱组件,红队测试已成为识别潜在风险的关键手段。该方法通过对抗性提示对LLMs进行测试,以发现漏洞并提升安全对齐效果。理想情况下,有效的红队测试应能适应不断演化的LLMs能力,并覆盖广泛的有害主题。然而,现有方法存在两大局限:1)基于主题的方法依赖预收集的有害主题,灵活性与适应性受限;2)无主题方法使用强化学习(RL),但缺乏明确的探索奖励信号,易过度优化单一目标,降低主题多样性。为解决上述问题,我们提出RedTopic——一种通过情境化生成管道、聚合奖励设计及多目标RL训练循环生成主题多样化对抗提示的新型红队测试框架。实验表明,相较于现有方法,RedTopic能生成更有效且多样化的对抗提示,在综合评估指标上取得显著提升。我们认为RedTopic标志着大语言模型红队测试向更自适应、主题多样化方向迈出了一步。