Most multi-agent reinforcement learning approaches adopt two types of policy optimization methods that either update policy simultaneously or sequentially. Simultaneously updating policies of all agents introduces non-stationarity problem. Although sequentially updating policies agent-by-agent in an appropriate order improves policy performance, it is prone to low efficiency due to sequential execution, resulting in longer model training and execution time. Intuitively, partitioning policies of all agents according to their interdependence and updating joint policy batch-by-batch can effectively balance performance and efficiency. However, how to determine the optimal batch partition of policies and batch updating order are challenging problems. Firstly, a sequential batched policy updating scheme, B2MAPO (Batch by Batch Multi-Agent Policy Optimization), is proposed with a theoretical guarantee of the monotonic incrementally tightened bound. Secondly, a universal modulized plug-and-play B2MAPO hierarchical framework, which satisfies CTDE principle, is designed to conveniently integrate any MARL models to fully exploit and merge their merits, including policy optimality and inference efficiency. Next, a DAG-based B2MAPO algorithm is devised, which is a carefully designed implementation of B2MAPO framework. Comprehensive experimental results conducted on StarCraftII Multi-agent Challenge and Google Football Research demonstrate the performance of DAG-based B2MAPO algorithm outperforms baseline methods. Meanwhile, compared with A2PO, our algorithm reduces the model training and execution time by 60.4% and 78.7%, respectively.
翻译:多数多智能体强化学习方法采用两种策略优化方式:同时更新或顺序更新。同时更新所有智能体策略会引入非平稳性问题。虽然按适当顺序逐智能体顺序更新策略能提升策略性能,但由于顺序执行易导致效率低下,造成模型训练和执行时间延长。直观而言,根据智能体间的相互依赖关系对其策略进行分区,并逐批次更新联合策略,可有效平衡性能与效率。然而,如何确定策略的最优批次划分及批次更新顺序仍是具有挑战性的问题。首先,本文提出具有单调递增紧界理论保证的顺序批次策略更新方案——B2MAPO(逐批次多智能体策略优化)。其次,设计满足CTDE原则的通用模块化即插即用B2MAPO分层框架,可便捷集成任意MARL模型以充分发掘并融合其策略最优性与推理效率等优势。进而,提出基于有向无环图的B2MAPO算法,该算法是B2MAPO框架的精心设计实现。在星际争霸II多智能体挑战与谷歌足球研究平台上的综合实验结果表明,基于有向无环图的B2MAPO算法性能优于基线方法。同时,相较于A2PO算法,本算法将模型训练时间与执行时间分别降低了60.4%和78.7%。