Decentralized execution is one core demand in cooperative multi-agent reinforcement learning (MARL). Recently, most popular MARL algorithms have adopted decentralized policies to enable decentralized execution and use gradient descent as their optimizer. However, there is hardly any theoretical analysis of these algorithms taking the optimization method into consideration, and we find that various popular MARL algorithms with decentralized policies are suboptimal in toy tasks when gradient descent is chosen as their optimization method. In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods to prove their suboptimality when gradient descent is used. In addition, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived ``single-agent" MDP. This approach uses a two-stage learning paradigm to address the optimization problem in cooperative MARL, maintaining its performance guarantee. Empirically, we implement TAD-PPO based on PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.
翻译:去中心化执行是合作多智能体强化学习(MARL)的核心需求之一。近年来,多数主流MARL算法采用去中心化策略以实现去中心化执行,并以梯度下降作为优化器。然而,现有理论分析鲜少考虑此类算法的优化方法,我们发现当选择梯度下降作为优化方法时,多种采用去中心化策略的主流MARL算法在简单任务中存在次优性。本文从理论上分析了两种常见的去中心化策略算法——多智能体策略梯度方法与值分解方法,证明了其在梯度下降条件下的次优性。此外,我们提出了转换与蒸馏(TAD)框架,该框架将多智能体马尔可夫决策过程重新表述为具有序列结构的特殊单智能体马尔可夫决策过程,并通过蒸馏在推导出的"单智能体"马尔可夫决策过程上学习到的策略来实现去中心化执行。该方法采用两阶段学习范式解决合作MARL中的优化问题,并保持其性能保证。实验方面,我们基于PPO实现了TAD-PPO算法,该算法在有限多智能体马尔可夫决策过程中可实现理论最优策略学习,并在大量合作多智能体任务中表现出显著优越性。