While multi-agent trust region algorithms have achieved great success empirically in solving coordination tasks, most of them, however, suffer from a non-stationarity problem since agents update their policies simultaneously. In contrast, a sequential scheme that updates policies agent-by-agent provides another perspective and shows strong performance. However, sample inefficiency and lack of monotonic improvement guarantees for each agent are still the two significant challenges for the sequential scheme. In this paper, we propose the \textbf{A}gent-by-\textbf{a}gent \textbf{P}olicy \textbf{O}ptimization (A2PO) algorithm to improve the sample efficiency and retain the guarantees of monotonic improvement for each agent during training. We justify the tightness of the monotonic improvement bound compared with other trust region algorithms. From the perspective of sequentially updating agents, we further consider the effect of agent updating order and extend the theory of non-stationarity into the sequential update scheme. To evaluate A2PO, we conduct a comprehensive empirical study on four benchmarks: StarCraftII, Multi-agent MuJoCo, Multi-agent Particle Environment, and Google Research Football full game scenarios. A2PO consistently outperforms strong baselines.
翻译:尽管多智能体信任域算法在解决协调任务方面取得了显著的经验成功,但大多数算法因智能体同时更新策略而面临非平稳性问题。相比之下,逐智能体顺序更新的方案提供了另一种视角,并展现出强劲性能。然而,样本效率低下以及缺乏对每个智能体的单调改进保证,仍是顺序方案面临的两大挑战。在本文中,我们提出**逐智能体策略优化**(A2PO)算法,以提升样本效率并保留训练过程中每个智能体单调改进的保证。我们论证了该单调改进界相较于其他信任域算法的紧致性。从顺序更新智能体的角度出发,我们进一步考虑了智能体更新顺序的影响,并将非平稳性理论扩展至顺序更新方案。为评估A2PO,我们在四个基准测试上进行了全面的实证研究:StarCraftII、Multi-agent MuJoCo、Multi-agent Particle Environment以及Google Research Football完整比赛场景。A2PO始终优于强基线方法。