While Centralized Training with Decentralized Execution (CTDE) has become the prevailing paradigm in Multi-Agent Reinforcement Learning (MARL), it may not be suitable for scenarios in which agents can fully communicate and share observations with each other. Fully centralized methods, also know as Centralized Training with Centralized Execution (CTCE) methods, can fully utilize observations of all the agents by treating the entire system as a single agent. However, traditional CTCE methods suffer from scalability issues due to the exponential growth of the joint action space. To address these challenges, in this paper we propose JointPPO, a CTCE method that uses Proximal Policy Optimization (PPO) to directly optimize the joint policy of the multi-agent system. JointPPO decomposes the joint policy into conditional probabilities, transforming the decision-making process into a sequence generation task. A Transformer-based joint policy network is constructed, trained with a PPO loss tailored for the joint policy. JointPPO effectively handles a large joint action space and extends PPO to multi-agent setting with theoretical clarity and conciseness. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) testbed demonstrate the superiority of JointPPO over the strong baselines. Ablation experiments and analyses are conducted to explores the factors influencing JointPPO's performance.
翻译:虽然集中式训练与分散式执行(CTDE)已成为多智能体强化学习(MARL)中的主流范式,但它可能不适用于智能体能够完全通信并共享观测结果的场景。全集中式方法,也称为集中式训练与集中式执行(CTCE)方法,通过将整个系统视为单个智能体,可以充分利用所有智能体的观测信息。然而,传统的CTCE方法由于联合动作空间呈指数级增长而面临可扩展性问题。为解决这些挑战,本文提出JointPPO——一种利用近端策略优化(PPO)直接优化多智能体系统联合策略的CTCE方法。JointPPO将联合策略分解为条件概率,将决策过程转化为序列生成任务。我们构建了基于Transformer的联合策略网络,并使用专为联合策略设计的PPO损失函数进行训练。JointPPO有效处理了大规模联合动作空间,并以理论清晰简洁的方式将PPO扩展到多智能体场景。在星际争霸多智能体挑战(SMAC)测试平台上的大量实验表明,JointPPO相比强基线方法具有优越性能。我们通过消融实验和分析探究了影响JointPPO性能的因素。