This paper presents an extension of the Mirror Descent method to overcome challenges in cooperative Multi-Agent Reinforcement Learning (MARL) settings, where agents have varying abilities and individual policies. The proposed Heterogeneous-Agent Mirror Descent Policy Optimization (HAMDPO) algorithm utilizes the multi-agent advantage decomposition lemma to enable efficient policy updates for each agent while ensuring overall performance improvements. By iteratively updating agent policies through an approximate solution of the trust-region problem, HAMDPO guarantees stability and improves performance. Moreover, the HAMDPO algorithm is capable of handling both continuous and discrete action spaces for heterogeneous agents in various MARL problems. We evaluate HAMDPO on Multi-Agent MuJoCo and StarCraftII tasks, demonstrating its superiority over state-of-the-art algorithms such as HATRPO and HAPPO. These results suggest that HAMDPO is a promising approach for solving cooperative MARL problems and could potentially be extended to address other challenging problems in the field of MARL.
翻译:本文提出了镜像下降方法的扩展,以解决合作式多智能体强化学习(MARL)中智能体能力各异且拥有独立策略的挑战。所提出的异质智能体镜像下降策略优化(HAMDPO)算法利用多智能体优势分解引理,在确保整体性能提升的同时,实现每个智能体策略的高效更新。通过信任域问题的近似解迭代更新智能体策略,HAMDPO保证了稳定性并提升了性能。此外,该算法能够处理异质智能体在多种MARL问题中的连续及离散动作空间。我们在多智能体MuJoCo和星际争霸II任务上评估了HAMDPO,证明其优于HATRPO和HAPPO等最先进算法。这些结果表明,HAMDPO是解决合作式MARL问题的一种有前景的方法,并有可能扩展至该领域其他具有挑战性的问题。