Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a centralized model with Monte Carlo Tree Search (MCTS) for policy search. We design a novel network structure to facilitate distributed execution and parameter sharing. To enhance search efficiency in deterministic environments with sizable action spaces, we introduce two novel techniques: Optimistic Search Lambda (OS($\lambda$)) and Advantage-Weighted Policy Optimization (AWPO). Extensive experiments on the SMAC benchmark demonstrate that MAZero outperforms model-free approaches in terms of sample efficiency and provides comparable or better performance than existing model-based methods in terms of both sample and computational efficiency. Our code is available at https://github.com/liuqh16/MAZero.
翻译:多智能体强化学习算法在解决大规模决策任务方面取得了显著突破,但现有算法大多基于无模型方法,样本效率受限,难以应用于更具挑战性的场景。相比之下,基于模型的强化学习,特别是集成规划方法的算法(如MuZero),已在多个任务中以有限数据展现出超人类性能。为此,我们致力于通过采用基于模型的方法提升多智能体强化学习的样本效率。然而,将规划与搜索方法融入多智能体系统面临重大挑战:智能体庞大的动作空间通常需要利用其近似独立性来加速学习。为应对这一难题,我们提出MAZero算法,该算法结合集中式模型与蒙特卡洛树搜索(MCTS)进行策略搜索,并设计新型网络结构以支持分布式执行与参数共享。针对具有庞大动作空间确定性环境下的搜索效率问题,我们引入两项创新技术:乐观搜索Lambda(OS(λ))和优势加权策略优化(AWPO)。在SMAC基准上的大量实验表明,MAZero在样本效率上优于无模型方法,同时在样本与计算效率方面达到或超越现有基于模型的方法。我们的代码已开源至https://github.com/liuqh16/MAZero。