Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames -- games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-rl.
翻译:在多智能体强化学习(MARL)的复杂零和博弈中学习纳什均衡(NE)可能极其耗费计算资源。课程学习是加速学习的有效途径,但子博弈(从特定状态出发所诱发的博弈)的学习难度这一生成课程的重要维度尚未被充分探索。本文提出了一种新颖的零和博弈子博弈课程学习框架。该框架通过将智能体重置至先前访问过且能快速提升性能的状态,实现自适应初始状态分布。基于此框架,我们推导出可近似度量与纳什均衡值平方距离的子博弈选择指标,并进一步采用基于粒子的状态采样器生成子博弈。这些技术的整合形成了我们的新算法——子博弈自动课程学习(SACL),它是子博弈课程学习框架的具体实现。SACL可与任何MARL算法(如MAPPO)结合使用。在粒子世界环境和Google Research Football环境中的实验表明,SACL能产生比基线方法更优的策略。在具有挑战性的Hide-and-Seek象限环境中,SACL不仅展现了全部四个涌现阶段,且仅需MAPPO自博弈方法一半的样本量。项目网站链接为https://sites.google.com/view/sacl-rl。