Policy optimization methods with function approximation are widely used in multi-agent reinforcement learning. However, it remains elusive how to design such algorithms with statistical guarantees. Leveraging a multi-agent performance difference lemma that characterizes the landscape of multi-agent policy optimization, we find that the localized action value function serves as an ideal descent direction for each local policy. Motivated by the observation, we present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate. We extend our algorithm to the off-policy setting and introduce pessimism to policy evaluation, which aligns with experiments. To our knowledge, this is the first provably convergent multi-agent PPO algorithm in cooperative Markov games.
翻译:采用函数近似的策略优化方法在多智能体强化学习中被广泛应用。然而,如何设计具有统计保证的此类算法仍具有挑战性。利用刻画多智能体策略优化景观的多智能体性能差异引理,我们发现局部化动作价值函数可作为每个局部策略的理想下降方向。受此启发,我们提出一种多智能体PPO算法,其中每个智能体的局部策略更新方式类似于标准PPO。我们证明,在马尔可夫博弈的标准正则性条件和问题相关量条件下,我们的算法以次线性速率收敛至全局最优策略。我们将算法扩展至离策略设置,并在策略评估中引入悲观性,这与实验结果一致。据我们所知,这是首个在合作型马尔可夫博弈中具有收敛保证的多智能体PPO算法。