MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARSHAL, an end-to-end RL framework that incentivizes Multi-Agent Reasoning through Self-play witH strAtegic LLMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to 28.7% performance improvements in held-out games. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero-shot performance gains of up to 10.0% on AIME, 7.6% on GPQA-Diamond, and 3.5% on average across all benchmarks. These results establish self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.

翻译：开发能够有效在多智能体系统中协作与竞争的大语言模型，是实现更高级智能的关键一步。尽管强化学习在提升单智能体任务推理能力方面已被证明有效，但由于长时程信用分配和智能体特定优势估计的挑战，其在多轮次、多智能体场景中的扩展仍未被充分探索。为解决这些挑战，我们提出了MARSHAL，一种端到端的强化学习框架，通过在协作与竞争游戏中与战略性大语言模型进行自博弈来激励多智能体推理。MARSHAL具有一个轮次级优势估计器，可将学习信号与每次交互对齐以实现信用分配，以及一个智能体特定的优势归一化机制以稳定多智能体训练。通过在协作与竞争游戏中进行自博弈学习，基于Qwen3-4B训练的MARSHAL智能体发展出强大的策略能力，在保留测试游戏中的性能提升高达28.7%。更重要的是，通过自博弈获得的能力能够泛化至游戏之外，在多智能体系统的推理基准测试中带来持续的性能提升。当集成到领先的多智能体系统中时，我们的MARSHAL智能体在AIME上实现了高达10.0%的零样本性能提升，在GPQA-Diamond上提升7.6%，在所有基准测试中平均提升3.5%。这些结果表明，战略性游戏中的自博弈是开发大语言模型中可泛化多智能体推理能力的一种强大方法。