Heterogeneous-Agent Reinforcement Learning

The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in AI research. However, many research endeavours heavily rely on parameter sharing among agents, which confines them to only homogeneous-agent setting and leads to training instability and lack of convergence guarantees. To achieve effective cooperation in the general heterogeneous-agent setting, we propose Heterogeneous-Agent Reinforcement Learning (HARL) algorithms that resolve the aforementioned issues. Central to our findings are the multi-agent advantage decomposition lemma and the sequential update scheme. Based on these, we develop the provably correct Heterogeneous-Agent Trust Region Learning (HATRL) that is free of parameter-sharing constraint, and derive HATRPO and HAPPO by tractable approximations. Furthermore, we discover a novel framework named Heterogeneous-Agent Mirror Learning (HAML), which strengthens theoretical guarantees for HATRPO and HAPPO and provides a general template for cooperative MARL algorithmic designs. We prove that all algorithms derived from HAML inherently enjoy monotonic improvement of joint reward and convergence to Nash Equilibrium. As its natural outcome, HAML validates more novel algorithms in addition to HATRPO and HAPPO, including HAA2C, HADDPG, and HATD3, which consistently outperform their existing MA-counterparts. We comprehensively test HARL algorithms on six challenging benchmarks and demonstrate their superior effectiveness and stability for coordinating heterogeneous agents compared to strong baselines such as MAPPO and QMIX.

翻译：智能机器间的合作需求推动了人工智能研究中合作性多智能体强化学习（MARL）的普及。然而，许多研究过度依赖智能体间的参数共享，这将其局限于仅适用于同构智能体的场景，并导致训练不稳定及缺乏收敛性保证。为在通用异构智能体场景下实现有效合作，我们提出异构智能体强化学习（HARL）算法以解决上述问题。我们研究核心在于多智能体优势分解引理与顺序更新方案。基于此，我们开发了无参数共享约束的可证明正确的异构智能体信任域学习（HATRL），并通过可处理近似推导出HATRPO与HAPPO。进一步，我们发现了名为异构智能体镜像学习（HAML）的新型框架，该框架强化了HATRPO与HAPPO的理论保证，并为合作性MARL算法设计提供了通用模板。我们证明所有源自HAML的算法天然具备联合奖励单调提升性及纳什均衡收敛性。作为自然产物，HAML除HATRPO与HAPPO外还验证了更多新型算法，包括HAA2C、HADDPG与HATD3，这些算法均持续优于其现有MA对应版本。我们通过六个具有挑战性的基准测试全面检验了HARL算法，并证明了其在协调异构智能体方面相较于MAPPO、QMIX等强基线方法具有更优越的有效性与稳定性。