Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning \emph{stochastic} policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective for MARL. Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for MaxEnt algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration.
翻译:多智能体强化学习(MARL)近年来在合作博弈中展现出有效性。然而,现有最优方法面临样本复杂度高、训练不稳定以及收敛到次优纳什均衡的风险。本文提出一个统一的随机策略学习框架以解决上述问题。我们将合作MARL问题嵌入概率图模型,从中推导出MARL的最大熵(MaxEnt)目标函数。基于该MaxEnt框架,提出异质智能体软演员-评论家(HASAC)算法。理论上,我们证明了HASAC的单调改进性与收敛到量子响应均衡(QRE)的特性。进一步地,我们归纳出一个MaxEnt算法设计的统一模板——最大熵异质智能体镜像学习(MEHAML),该模板可为任意衍生方法提供与HASAC相同的理论保证。我们在六个基准测试中评估HASAC:Bi-DexHands、多智能体MuJoCo、星际争霸多智能体挑战赛、谷歌研究足球、多智能体粒子环境及轻型飞机博弈。结果表明,HASAC持续优于强基线方法,展现出更优的样本效率、鲁棒性与充分探索能力。