Policy gradient methods enjoy strong practical performance in numerous tasks in reinforcement learning. Their theoretical understanding in multiagent settings, however, remains limited, especially beyond two-player competitive and potential Markov games. In this paper, we develop a new framework to characterize optimistic policy gradient methods in multi-player Markov games with a single controller. Specifically, under the further assumption that the game exhibits an equilibrium collapse, in that the marginals of coarse correlated equilibria (CCE) induce Nash equilibria (NE), we show convergence to stationary $\epsilon$-NE in $O(1/\epsilon^2)$ iterations, where $O(\cdot)$ suppresses polynomial factors in the natural parameters of the game. Such an equilibrium collapse is well-known to manifest itself in two-player zero-sum Markov games, but also occurs even in a class of multi-player Markov games with separable interactions, as established by recent work. As a result, we bypass known complexity barriers for computing stationary NE when either of our assumptions fails. Our approach relies on a natural generalization of the classical Minty property that we introduce, which we anticipate to have further applications beyond Markov games.
翻译:策略梯度方法在强化学习的众多任务中展现出强大的实际性能。然而,其在多智能体环境中的理论理解仍然有限,尤其超出双人竞争与势博弈马尔可夫博弈范畴。本文提出了一种新框架,用于刻画具有单一控制器的多人马尔可夫博弈中的乐观策略梯度方法。具体而言,在进一步假设博弈存在均衡坍缩(即粗相关均衡的边缘分布诱导纳什均衡)的条件下,我们证明该方法在$O(1/\epsilon^2)$次迭代内收敛至$\epsilon$-纳什均衡,其中$O(\cdot)$隐去了博弈自然参数的多项式因子。这种均衡坍缩在双人零和马尔可夫博弈中众所周知,但如近期工作所确立,它也存在于一类具有可分离交互的多人马尔可夫博弈中。因此,我们规避了在假设不成立时计算纳什均衡的已知复杂度障碍。我们的方法依赖于我们引入的经典敏蒂性质的自然推广,预计该推广在马尔可夫博弈之外也有进一步应用。