Traditional centralized multi-agent reinforcement learning (MARL) algorithms are sometimes unpractical in complicated applications, due to non-interactivity between agents, curse of dimensionality and computation complexity. Hence, several decentralized MARL algorithms are motivated. However, existing decentralized methods only handle the fully cooperative setting where massive information needs to be transmitted in training. The block coordinate gradient descent scheme they used for successive independent actor and critic steps can simplify the calculation, but it causes serious bias. In this paper, we propose a flexible fully decentralized actor-critic MARL framework, which can combine most of actor-critic methods, and handle large-scale general cooperative multi-agent setting. A primal-dual hybrid gradient descent type algorithm framework is designed to learn individual agents separately for decentralization. From the perspective of each agent, policy improvement and value evaluation are jointly optimized, which can stabilize multi-agent policy learning. Furthermore, our framework can achieve scalability and stability for large-scale environment and reduce information transmission, by the parameter sharing mechanism and a novel modeling-other-agents methods based on theory-of-mind and online supervised learning. Sufficient experiments in cooperative Multi-agent Particle Environment and StarCraft II show that our decentralized MARL instantiation algorithms perform competitively against conventional centralized and decentralized methods.
翻译:传统集中式多智能体强化学习算法因智能体间缺乏交互性、维数灾难及计算复杂性,在复杂应用中常不切实际。为此,研究者提出了若干分散式多智能体强化学习算法。然而,现有分散方法仅能处理全合作场景,其训练过程中需传输大量信息。此类方法采用块坐标梯度下降方案交替执行行动者与评论家步骤,虽简化了计算,却引发严重偏差。本文提出一种灵活的全分散行动者-评论家多智能体强化学习框架,该框架可兼容大多数行动者-评论家方法,并能处理大规模通用合作多智能体场景。我们设计基于原始-对偶混合梯度下降的算法框架,通过独立学习各智能体实现分散化。从每个智能体视角出发,策略改进与价值评估被联合优化,从而稳定多智能体策略学习。此外,该框架通过参数共享机制及基于心智理论与在线监督学习的创新性其他智能体建模方法,实现了大规模环境中的可扩展性与稳定性,并减少了信息传输量。在合作多智能体粒子环境与星际争霸II上的充分实验表明,我们提出的分散式多智能体强化学习实例化算法在性能上优于传统集中式与分散式方法。