The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in AI research. However, many research endeavours heavily rely on parameter sharing among agents, which confines them to only homogeneous-agent setting and leads to training instability and lack of convergence guarantees. To achieve effective cooperation in the general heterogeneous-agent setting, we propose Heterogeneous-Agent Reinforcement Learning (HARL) algorithms that resolve the aforementioned issues. Central to our findings are the multi-agent advantage decomposition lemma and the sequential update scheme. Based on these, we develop the provably correct Heterogeneous-Agent Trust Region Learning (HATRL), and derive HATRPO and HAPPO by tractable approximations. Furthermore, we discover a novel framework named Heterogeneous-Agent Mirror Learning (HAML), which strengthens theoretical guarantees for HATRPO and HAPPO and provides a general template for cooperative MARL algorithmic designs. We prove that all algorithms derived from HAML inherently enjoy monotonic improvement of joint return and convergence to Nash Equilibrium. As its natural outcome, HAML validates more novel algorithms in addition to HATRPO and HAPPO, including HAA2C, HADDPG, and HATD3, which generally outperform their existing MA-counterparts. We comprehensively test HARL algorithms on six challenging benchmarks and demonstrate their superior effectiveness and stability for coordinating heterogeneous agents compared to strong baselines such as MAPPO and QMIX.
翻译:智能机器间的协作需求推动了人工智能研究中合作式多智能体强化学习(MARL)的普及。然而,许多研究过度依赖智能体间的参数共享,这不仅将其局限于同构智能体场景,还导致训练不稳定且缺乏收敛性保证。为在通用异构智能体场景下实现高效协作,我们提出异质性智能体强化学习(HARL)算法以解决上述问题。我们的核心发现包括多智能体优势分解引理与顺序更新方案。基于这些发现,我们构建了可证明正确的异质性智能体信任区域学习(HATRL),并通过可处理的近似推导出HATRPO与HAPPO。进一步,我们提出名为异质性智能体镜像学习(HAML)的新框架,该框架强化了HATRPO与HAPPO的理论保证,并为合作式MARL算法设计提供通用模板。我们证明,所有源自HAML的算法天然具备联合回报单调提升与收敛至纳什均衡的特性。作为自然衍生结果,HAML在HATRPO与HAPPO之外还验证了HAA2C、HADDPG与HATD3等新型算法,这些算法普遍优于现有多智能体对应算法。我们在六个具有挑战性的基准测试中全面评估HARL算法,并证明其在协调异质性智能体方面相较于MAPPO与QMIX等强基线方法具有显著的有效性与稳定性优势。