Multi-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the costs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. We also generalize the results to certain function approximation settings. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings.
翻译:多智能体交互在强化学习背景下日益重要,策略梯度方法的理论基础引发了研究热潮。本文研究了自然策略梯度算法在多智能体学习中的全局收敛性。我们首先表明,即使成本函数经过正则化处理(这已在文献中保证了策略空间的强收敛性),原始NPG算法仍可能无法实现参数收敛(即参数化策略的向量收敛)。这种参数非收敛性会导致学习稳定性问题,在函数逼近场景中尤为关键——因为在此场景下我们只能操作低维参数,而非高维策略。随后我们针对多个标准多智能体学习场景(双人零和矩阵博弈与马尔可夫博弈,以及多人单调博弈)提出了NPG算法变体,这些变体具有全局最后迭代参数收敛保证。我们还将结果推广至某些函数逼近设置。值得注意的是,我们的算法中每个智能体均扮演对称角色。本文结果对于求解具有特定结构的非凸-非凹极小极大优化问题可能具有独立价值。最后提供仿真实验以验证理论发现。