Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial ``jailbreak'' attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.

翻译：大型语言模型（LLM）已展现出卓越的能力，但仍易受到旨在绕过安全防护的对抗性“越狱”攻击。当前的安全对齐方法严重依赖静态的外部红队测试，使用固定的防御提示或预收集的对抗性数据集，导致防御机制僵化，过度拟合已知模式而无法泛化至新颖复杂的威胁。为应对这一关键局限，我们提出赋能模型使其成为自身的红队测试者，能够实现自主且持续演进的对抗攻击。具体而言，我们引入安全自我博弈（SSP）系统，该系统利用单一LLM在统一的强化学习（RL）循环中同时扮演攻击者（生成越狱指令）与防御者（拒绝有害请求）双重角色，动态演化攻击策略以发现漏洞，同时强化防御机制。为确保防御者在自我博弈过程中有效应对关键安全问题，我们引入先进的反思经验回放机制，该机制利用整个过程中积累的经验池，采用上置信界（UCB）采样策略聚焦于低奖励的失败案例，帮助模型从过往困难错误中学习，同时平衡探索与利用。大量实验表明，我们的SSP方法能自主演化出强大的防御能力，显著优于基于静态对抗性数据集训练的基线方法，为主动安全对齐树立了新标杆。