Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning stochastic policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective for MARL. Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for MaxEnt algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration. See our page at https://sites.google.com/view/meharl.
翻译:近年来,多智能体强化学习(MARL)在合作博弈中展现出显著成效。然而,现有前沿方法仍面临样本复杂度高、训练稳定性不足以及可能收敛至次优纳什均衡等挑战。本文提出一个学习随机策略的统一框架以解决上述问题。我们将合作型MARL问题嵌入概率图模型,并由此推导出MARL的最大熵目标函数。基于该最大熵框架,我们提出异构智能体柔性行动者-评论家算法。理论上,我们证明了HASAC算法具有单调改进特性,并能收敛至量子响应均衡。进一步地,我们提出了名为最大熵异构智能体镜像学习的统一算法设计模板,该模板可为任何衍生方法提供与HASAC相同的理论保证。我们在六个基准测试中评估HASAC算法:Bi-DexHands、多智能体MuJoCo、星际争霸多智能体挑战赛、谷歌研究足球环境、多智能体粒子环境以及轻型飞行器博弈。实验结果表明,HASAC算法在各项基准中均优于现有强基线方法,展现出更优的样本效率、鲁棒性和充分的探索能力。详见项目页面:https://sites.google.com/view/meharl。