While multi-agent reinforcement learning (MARL) has been proven effective across both collaborative and competitive tasks, existing algorithms often struggle to scale to large populations of agents. Recent advancements in mean-field (MF) theory provide scalable solutions by approximating population interactions as a continuum, yet most existing frameworks focus exclusively on either fully cooperative or purely competitive settings. To bridge this gap, we introduce MF-MAPPO, a mean-field extension of PPO designed for zero-sum team games that integrate intra-team cooperation with inter-team competition. MF-MAPPO employs a shared actor and a minimally informed critic per team and is trained directly on finite-population simulators, thereby enabling deployment to realistic scenarios with thousands of agents. We further show that MF-MAPPO naturally extends to partially observable settings through a simple gradient-regularized training scheme. Our evaluation utilizes large-scale benchmark scenarios using our own testing simulation platform for MF team games (MFEnv), including offense-defense battlefield tasks as well as variants of population-based rock-paper-scissors games that admit analytical solutions, for benchmarking. Across these benchmarks, MF-MAPPO outperforms existing methods and exhibits complex, heterogeneous behaviors, demonstrating the effectiveness of combining mean-field theory and MARL techniques at scale.
翻译:尽管多智能体强化学习(MARL)在协作与竞争任务中均被证明有效,但现有算法往往难以扩展至大规模智能体群体。平均场(MF)理论的最新进展通过将群体交互近似为连续统,提供了可扩展的解决方案,然而现有框架大多仅关注完全协作或纯粹竞争的场景。为弥补这一差距,我们提出了MF-MAPPO,一种专为零和团队博弈设计的PPO平均场扩展方法,该方法融合了团队内协作与团队间竞争。MF-MAPPO为每个团队采用共享执行器与最小信息评价器,并直接在有限群体模拟器上进行训练,从而能够部署于包含数千智能体的现实场景。我们进一步证明,通过简单的梯度正则化训练方案,MF-MAPPO可自然扩展至部分可观测环境。评估工作采用我们自主研发的MF团队博弈测试仿真平台(MFEnv)中的大规模基准场景,包括攻防战场任务以及允许解析解的群体版石头剪刀布游戏变体。在所有基准测试中,MF-MAPPO均优于现有方法,并展现出复杂、异构的行为,这证明了平均场理论与MARL技术在大规模场景中结合的有效性。