In multi-agent reinforcement learning, optimal control with robustness guarantees are critical for its deployment in real world. However, existing methods face challenges related to sample complexity, training instability, potential suboptimal Nash Equilibrium convergence and non-robustness to multiple perturbations. In this paper, we propose a unified framework for learning \emph{stochastic} policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective optimal for MARL. Based on the MaxEnt framework, we propose \emph{Heterogeneous-Agent Soft Actor-Critic} (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to \emph{quantal response equilibrium} (QRE) properties of HASAC. Furthermore, HASAC is provably robust against a wide range of real-world uncertainties, including perturbations in rewards, environment dynamics, states, and actions. Finally, we generalize a unified template for MaxEnt algorithmic design named \emph{Maximum Entropy Heterogeneous-Agent Mirror Learning} (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on seven benchmarks: Bi-DexHands, Multi-Agent MuJoCo, Pursuit-Evade, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines in 34 out of 38 tasks, exhibiting improved training stability, better sample efficiency and sufficient exploration. The robustness of HASAC was further validated when encountering uncertainties in rewards, dynamics, states, and actions of 14 magnitudes, and real-world deployment in a multi-robot arena against these four types of uncertainties. See our page at \url{https://sites.google.com/view/meharl}.
翻译:在多智能体强化学习中,具备鲁棒性保证的最优控制对其实际部署至关重要。然而,现有方法面临样本复杂度高、训练不稳定、可能收敛至次优纳什均衡以及对多重扰动缺乏鲁棒性等挑战。本文提出一个学习随机策略的统一框架以解决这些问题。我们将合作式多智能体强化学习问题嵌入概率图模型,并由此推导出适用于多智能体强化学习的最大熵优化目标。基于最大熵框架,我们提出异构智能体柔性演员-评论家算法。理论上,我们证明了HASAC具有单调改进特性并能收敛至量子响应均衡。此外,HASAC被证明能够有效抵御广泛的实际不确定性,包括奖励、环境动力学、状态和动作的扰动。最后,我们推广出名为最大熵异构智能体镜像学习的统一算法设计模板,该模板能为任何衍生的方法提供与HASAC相同的理论保证。我们在七个基准测试上评估HASAC:Bi-DexHands、多智能体MuJoCo、追逃博弈、星际争霸多智能体挑战、谷歌研究足球、多智能体粒子环境、轻型飞行器博弈。结果表明,在38项任务中的34项上,HASAC始终优于强基线方法,展现出更好的训练稳定性、更高的样本效率和充分的探索能力。当面对14种量级的奖励、动力学、状态和动作不确定性时,以及在多机器人竞技场中实际部署对抗这四类不确定性的场景下,HASAC的鲁棒性得到了进一步验证。详情请访问我们的项目页面:\url{https://sites.google.com/view/meharl}。