Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample inefficiency, brittleness regarding hyperparameters, and the risk of converging to a suboptimal Nash Equilibrium. To resolve these issues, in this paper, we propose a novel theoretical framework, named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), that leverages the maximum entropy principle to design maximum entropy MARL actor-critic algorithms. We prove that algorithms derived from the MEHAML framework enjoy the desired properties of the monotonic improvement of the joint maximum entropy objective and the convergence to quantal response equilibrium (QRE). The practicality of MEHAML is demonstrated by developing a MEHAML extension of the widely used RL algorithm, HASAC (for soft actor-critic), which shows significant improvements in exploration and robustness on three challenging benchmarks: Multi-Agent MuJoCo, StarCraftII, and Google Research Football. Our results show that HASAC outperforms strong baseline methods such as HATD3, HAPPO, QMIX, and MAPPO, thereby establishing the new state of the art. See our project page at https://sites.google.com/view/mehaml.
翻译:多智能体强化学习(MARL)近年来在合作博弈中展现出有效性。然而,现有最先进方法面临样本效率低、超参数敏感性强以及收敛到次优纳什均衡的风险等挑战。为解决这些问题,本文提出一种名为“最大熵异质智能体镜像学习”(MEHAML)的新型理论框架,该框架利用最大熵原理设计最大熵MARL演员-评论家算法。我们证明,从MEHAML框架导出的算法具备联合最大熵目标单调提升以及收敛到量子响应均衡(QRE)的理想特性。通过开发广泛使用的强化学习算法HASAC(即软演员-评论家的MEHAML扩展),我们展示了MEHAML的实用性,该算法在三个具有挑战性的基准测试(Multi-Agent MuJoCo、StarCraftII和Google Research Football)中显著提升了探索能力和鲁棒性。实验结果表明,HASAC优于HATD3、HAPPO、QMIX和MAPPO等强基线方法,从而确立了最新状态。项目页面见 https://sites.google.com/view/mehaml。