The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. In response, we propose Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, we show up to 100x greater parameter efficiency than finetuning and consistent improvements in safety compared to self-play fine-tuned models. We evaluate on Qwen2.5-{3B, 7B,14B}-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability. Cross-play experiments further show that our attacker and defender models are superior to self-play in terms of adversarial defence and safety.

翻译：自我博弈红队攻击是提升人工智能安全性的成熟方法，其核心在于同一模型的不同实例在零和博弈中分别扮演攻击者与防御者角色——即攻击者试图破解防御者。若自我博弈收敛至纳什均衡，则模型能确保在博弈设定范围内安全响应。尽管双角色使用同一模型所带来的参数共享提升了稳定性与性能，但这引入了根本性的理论与架构局限。我们证明：可达到的纳什均衡集合对应着一大类行为，包括琐碎的"始终拒绝"策略与先知式防御者，从而限制了实际应用价值。进一步研究表明，当攻击者与防御者共享并更新同一基础模型时，动态过程会坍缩为自洽性，导致攻击无法对防御者施加对抗压力。为此，我们提出锚定双策略自我博弈：在冻结的基础模型上训练角色专用的LoRA适配器，通过显式角色分离保持优化稳定性的同时维持对抗压力。相较于标准自我博弈，该方法在微调参数效率上提升达100倍，且安全性指标持续优于经过自我博弈微调的模型。我们在Qwen2.5-{3B,7B,14B}-IT模型上采用广泛使用的安全基准进行评估，结果显示该方法在保持推理能力的同时增强了鲁棒性。交叉博弈实验进一步表明，我们的攻击者与防御者模型在对抗防御与安全性方面均优于自我博弈方法。