Existing multi-agent PPO algorithms lack compatibility with different types of parameter sharing when extending the theoretical guarantee of PPO to cooperative multi-agent reinforcement learning (MARL). In this paper, we propose a novel and versatile multi-agent PPO algorithm for cooperative MARL to overcome this limitation. Our approach is achieved upon the proposed full-pipeline paradigm, which establishes multiple parallel optimization pipelines by employing various equivalent decompositions of the advantage function. This procedure successfully formulates the interconnections among agents in a more general manner, i.e., the interconnections among pipelines, making it compatible with diverse types of parameter sharing. We provide a solid theoretical foundation for policy improvement and subsequently develop a practical algorithm called Full-Pipeline PPO (FP3O) by several approximations. Empirical evaluations on Multi-Agent MuJoCo and StarCraftII tasks demonstrate that FP3O outperforms other strong baselines and exhibits remarkable versatility across various parameter-sharing configurations.
翻译:现有的多智能体PPO算法在将PPO的理论保证扩展到合作式多智能体强化学习时,缺乏对不同类型参数共享的兼容性。本文提出一种新颖且通用的多智能体PPO算法以克服这一限制。该方法基于所提出的全流水线范式,通过采用优势函数的多种等价分解建立多条并行优化流水线。该过程以更通用的方式(即流水线间的相互关联)成功构建了智能体之间的连接,从而兼容多种类型的参数共享。我们为策略改进提供了坚实的理论基础,并通过若干近似方法开发出称为全流水线PPO(FP3O)的实用算法。在Multi-Agent MuJoCo和StarCraftII任务上的实证评估表明,FP3O优于其他强基线方法,并在不同参数共享配置下展现出显著的通用性。