In multi-agent reinforcement learning (MARL), independent learning (IL) often shows remarkable performance and easily scales with the number of agents. Yet, using IL can be inefficient and runs the risk of failing to successfully train, particularly in scenarios that require agents to coordinate their actions. Using centralised learning (CL) enables MARL agents to quickly learn how to coordinate their behaviour but employing CL everywhere is often prohibitively expensive in real-world applications. Besides, using CL in value-based methods often needs strong representational constraints (e.g. individual-global-max condition) that can lead to poor performance if violated. In this paper, we introduce a novel plug & play IL framework named Multi-Agent Network Selection Algorithm (MANSA) which selectively employs CL only at states that require coordination. At its core, MANSA has an additional agent that uses switching controls to quickly learn the best states to activate CL during training, using CL only where necessary and vastly reducing the computational burden of CL. Our theory proves MANSA preserves cooperative MARL convergence properties, boosts IL performance and can optimally make use of a fixed budget on the number CL calls. We show empirically in Level-based Foraging (LBF) and StarCraft Multi-agent Challenge (SMAC) that MANSA achieves fast, superior and more reliable performance while making 40% fewer CL calls in SMAC and using CL at only 1% CL calls in LBF.
翻译:在多智能体强化学习(MARL)中,独立学习(IL)通常表现出卓越的性能,并能轻松随智能体数量扩展。然而,使用IL可能效率低下,并存在训练失败的风险,尤其是在需要智能体协调行动的场景中。采用集中学习(CL)可使MARL智能体快速学习如何协调行为,但在实际应用中全面使用CL往往成本高昂。此外,在基于值的方法中使用CL通常需要强表征约束(如个体-全局-最大条件),若违反该条件则可能导致性能不佳。本文提出了一种新颖的即插即用IL框架——多智能体网络选择算法(MANSA),该算法仅在需要协调的状态下选择性使用CL。其核心在于,MANSA配备一个额外智能体,利用切换控制在训练过程中快速学习激活CL的最佳状态,仅在必要时使用CL,从而大幅降低CL的计算负担。我们的理论证明,MANSA能保持合作式MARL的收敛特性,提升IL性能,并能在固定CL调用预算下实现最优利用。我们在基于等级的觅食(LBF)和星际争霸多智能体挑战(SMAC)中的实验表明,MANSA实现了快速、优越且更可靠的性能,同时在SMAC中减少了40%的CL调用,在LBF中仅使用1%的CL调用。