Multi-Agent Policy Gradient (MAPG) has made significant progress in recent years. However, centralized critics in state-of-the-art MAPG methods still face the centralized-decentralized mismatch (CDM) issue, which means sub-optimal actions by some agents will affect other agent's policy learning. While using individual critics for policy updates can avoid this issue, they severely limit cooperation among agents. To address this issue, we propose an agent topology framework, which decides whether other agents should be considered in policy gradient and achieves compromise between facilitating cooperation and alleviating the CDM issue. The agent topology allows agents to use coalition utility as learning objective instead of global utility by centralized critics or local utility by individual critics. To constitute the agent topology, various models are studied. We propose Topology-based multi-Agent Policy gradiEnt (TAPE) for both stochastic and deterministic MAPG methods. We prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation for the improved cooperation among agents. Experiment results on several benchmarks show the agent topology is able to facilitate agent cooperation and alleviate CDM issue respectively to improve performance of TAPE. Finally, multiple ablation studies and a heuristic graph search algorithm are devised to show the efficacy of the agent topology.
翻译:多智能体策略梯度(MAPG)近年来取得了显著进展。然而,最先进的MAPG方法中的集中式评论家仍然面临集中-分散不匹配(CDM)问题,即某些智能体的次优行为会影响其他智能体的策略学习。虽然使用独立评论家进行策略更新可以避免这一问题,但会严重限制智能体间的协作。为解决该问题,我们提出智能体拓扑框架,该框架决定策略梯度中是否应考虑其他智能体,并在促进协作与缓解CDM问题之间实现平衡。智能体拓扑允许智能体将联盟效用作为学习目标,而非集中式评论家的全局效用或独立评论家的局部效用。为构建智能体拓扑,我们研究了多种模型。针对随机型和确定型MAPG方法,我们提出基于拓扑的多智能体策略梯度(TAPE)。我们证明了随机TAPE的策略改进定理,并对智能体间协作增强给出了理论解释。多个基准实验结果表明,智能体拓扑能够分别促进智能体协作和缓解CDM问题,从而提升TAPE性能。最后,通过多项消融研究和一种启发式图搜索算法验证了智能体拓扑的有效性。