Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning stochastic policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective for MARL. Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for MaxEnt algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration. See our project page at \url{https://sites.google.com/view/meharl}.
翻译:近年来,多智能体强化学习在合作博弈中展现出显著效果。然而,现有最先进方法仍面临样本复杂度高、训练不稳定以及可能收敛至次优纳什均衡等问题。本文提出一种学习随机策略的统一框架以解决这些挑战。我们将合作型多智能体强化学习问题嵌入概率图模型,并由此推导出多智能体强化学习的最大熵目标函数。基于最大熵框架,提出异质智能体软演员-评论家算法。理论上证明了HASAC的单调改进性与收敛至量子响应均衡的特性。进一步,我们提炼出名为最大熵异质智能体镜像学习的最大熵算法设计通用模板,该模板可为任何衍生方法提供与HASAC相同的理论保障。在Bi-DexHands、Multi-Agent MuJoCo、StarCraft Multi-Agent Challenge、Google Research Football、Multi-Agent Particle Environment及Light Aircraft Game六个基准测试中的评估表明:HASAC始终优于强基线方法,展现出更优的样本效率、鲁棒性与充分探索能力。项目页面见\url{https://sites.google.com/view/meharl}。