Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker's ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.
翻译:确保大语言模型(LLMs)具备鲁棒的安全对齐至关重要,然而现有防御方法常因**依赖于静态、预先收集的数据分布**而滞后于不断演化的对抗攻击。本文提出**MAGIC**,一种新颖的多轮多智能体强化学习框架,将LLM安全对齐建模为一种非对称对抗博弈。具体而言,攻击者智能体学习迭代地将原始查询改写为具有欺骗性的提示,而防御者智能体则同步优化其策略以识别并拒绝此类输入。这一动态过程引发了**协同进化**:攻击者不断变化的策略持续揭示长尾漏洞,从而驱动防御者泛化至未见过的攻击模式。值得注意的是,我们观察到,具备初始推理能力的攻击者通过迭代强化学习训练,进化出了**新颖的、先前未见过的组合策略**,这凸显了我们方法的巨大潜力。在理论上,我们深入分析了更鲁棒的博弈均衡并推导出安全保证。大量实验验证了我们框架的有效性,其在保持模型助益性的同时,展现出卓越的防御成功率。我们的代码公开于 https://github.com/BattleWen/MAGIC。