Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker's ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.
翻译:确保鲁棒的安全对齐对于大语言模型(LLMs)至关重要,然而现有防御方法因其**依赖于静态、预先收集的数据分布**,常常落后于不断演化的对抗攻击。本文提出**MAGIC**,一种新颖的多轮多智能体强化学习框架,将LLM安全对齐建模为一种非对称对抗博弈。具体而言,攻击者智能体学习迭代地将原始查询重写为欺骗性提示,而防御者智能体则同步优化其策略以识别并拒绝此类输入。这一动态过程触发了**协同演化**:攻击者不断变化的策略持续揭示长尾漏洞,从而驱动防御者泛化到未见过的攻击模式。值得注意的是,我们发现,具备初始推理能力的攻击者通过迭代的强化学习训练,演化出了**新颖的、先前未见过的组合策略**,这凸显了我们方法的巨大潜力。理论上,我们为更鲁棒的博弈均衡提供了洞见,并推导了安全性保证。大量实验验证了我们框架的有效性,其在保持模型助益性的同时,展现了卓越的防御成功率。我们的代码发布于 https://github.com/BattleWen/MAGIC。