Policy gradient methods enjoy strong practical performance in numerous tasks in reinforcement learning. Their theoretical understanding in multiagent settings, however, remains limited, especially beyond two-player competitive and potential Markov games. In this paper, we develop a new framework to characterize optimistic policy gradient methods in multi-player Markov games with a single controller. Specifically, under the further assumption that the game exhibits an equilibrium collapse, in that the marginals of coarse correlated equilibria (CCE) induce Nash equilibria (NE), we show convergence to stationary $\epsilon$-NE in $O(1/\epsilon^2)$ iterations, where $O(\cdot)$ suppresses polynomial factors in the natural parameters of the game. Such an equilibrium collapse is well-known to manifest itself in two-player zero-sum Markov games, but also occurs even in a class of multi-player Markov games with separable interactions, as established by recent work. As a result, we bypass known complexity barriers for computing stationary NE when either of our assumptions fails. Our approach relies on a natural generalization of the classical Minty property that we introduce, which we anticipate to have further applications beyond Markov games.
翻译:策略梯度方法在强化学习的众多任务中展现出强大的实际性能。然而,其在多智能体场景下的理论理解仍然有限,尤其是在超越两人竞争与势博弈马尔可夫博弈的领域。本文针对具有单一控制器的多人马尔可夫博弈,构建了一个刻画乐观策略梯度方法的新框架。具体而言,在进一步假设博弈存在均衡坍缩(即粗相关均衡的边缘分布可诱导出纳什均衡)的条件下,我们证明了算法在 $O(1/\epsilon^2)$ 次迭代内收敛到平稳 $\epsilon$-纳什均衡,其中 $O(\cdot)$ 隐去了博弈自然参数中的多项式因子。这种均衡坍缩现象在两人零和马尔可夫博弈中广为人知,但如近期工作所示,其同样存在于一类具有可分离交互结构的多人马尔可夫博弈中。据此,我们绕过了当上述任一假设不成立时计算平稳纳什均衡所面临的已知复杂度障碍。我们的方法依赖于对经典 Minty 性质的自然推广,预期该推广将超越马尔可夫博弈范畴而具有更广泛的应用前景。