In multi-agent reinforcement learning (MARL), independent learning (IL) often shows remarkable performance and easily scales with the number of agents. Yet, using IL can be inefficient and runs the risk of failing to successfully train, particularly in scenarios that require agents to coordinate their actions. Using centralised learning (CL) enables MARL agents to quickly learn how to coordinate their behaviour but employing CL everywhere is often prohibitively expensive in real-world applications. Besides, using CL in value-based methods often needs strong representational constraints (e.g. individual-global-max condition) that can lead to poor performance if violated. In this paper, we introduce a novel plug & play IL framework named Multi-Agent Network Selection Algorithm (MANSA) which selectively employs CL only at states that require coordination. At its core, MANSA has an additional agent that uses switching controls to quickly learn the best states to activate CL during training, using CL only where necessary and vastly reducing the computational burden of CL. Our theory proves MANSA preserves cooperative MARL convergence properties, boosts IL performance and can optimally make use of a fixed budget on the number CL calls. We show empirically in Level-based Foraging (LBF) and StarCraft Multi-agent Challenge (SMAC) that MANSA achieves fast, superior and more reliable performance while making 40% fewer CL calls in SMAC and using CL at only 1% CL calls in LBF.
翻译:在多智能体强化学习(MARL)中,独立学习(IL)通常表现出色,且易于随智能体数量扩展。然而,IL可能效率低下,尤其在需要智能体协调行动的场景中,存在训练失败的风险。集中学习(CL)使MARL智能体能够快速学习如何协调行为,但在实际应用中广泛使用CL往往成本过高。此外,在基于价值的方法中使用CL通常需要强表示约束(如个体-全局-最大条件),若违反该条件可能导致性能下降。本文提出一种新颖的即插即用IL框架——多智能体网络选择算法(MANSA),该算法仅在需要协调的状态下选择性使用CL。MANSA的核心是一个额外的智能体,它通过切换控制机制在训练过程中快速学习激活CL的最优状态,仅在必要时使用CL,从而大幅降低CL的计算负担。我们的理论证明MANSA能够保持协作MARL的收敛特性,提升IL性能,并能在固定CL调用预算下最优地利用资源。在基于等级的食物采集任务(LBF)和星际争霸多智能体挑战赛(SMAC)中的实验表明,MANSA在SMAC中减少40%的CL调用次数、在LBF中仅使用1%的CL调用次数的同时,实现了更快、更优且更稳定的性能表现。