While multi-agent trust region algorithms have achieved great success empirically in solving coordination tasks, most of them, however, suffer from a non-stationarity problem since agents update their policies simultaneously. In contrast, a sequential scheme that updates policies agent-by-agent provides another perspective and shows strong performance. However, sample inefficiency and lack of monotonic improvement guarantees for each agent are still the two significant challenges for the sequential scheme. In this paper, we propose the \textbf{A}gent-by-\textbf{a}gent \textbf{P}olicy \textbf{O}ptimization (A2PO) algorithm to improve the sample efficiency and retain the guarantees of monotonic improvement for each agent during training. We justify the tightness of the monotonic improvement bound compared with other trust region algorithms. From the perspective of sequentially updating agents, we further consider the effect of agent updating order and extend the theory of non-stationarity into the sequential update scheme. To evaluate A2PO, we conduct a comprehensive empirical study on four benchmarks: StarCraftII, Multi-agent MuJoCo, Multi-agent Particle Environment, and Google Research Football full game scenarios. A2PO consistently outperforms strong baselines.
翻译:虽然多智能体信任区域算法在解决协调任务方面已取得显著实证成功,但多数算法因智能体同时更新策略而面临非平稳性问题。相比之下,逐智能体顺序更新的方案提供了另一种视角并展现出强劲性能。然而,样本效率低下以及缺乏每个智能体的单调改进保证仍是通过顺序方案所面临的两大核心挑战。本文提出逐智能体策略优化(A2PO)算法,旨在提升训练过程中的样本效率,同时保留各智能体的单调改进保证。我们论证了该单调改进界相较于其他信任区域算法的紧致性。从逐智能体顺序更新的视角出发,我们进一步探讨了智能体更新顺序的影响,并将非平稳性理论扩展至顺序更新框架。为评估A2PO算法,我们在四个基准测试集(StarCraftII、多智能体MuJoCo、多智能体粒子环境及Google Research Football全场比赛场景)上开展了全面的实证研究。结果表明,A2PO算法持续优于强基线方法。