We study decentralized learning in two-player zero-sum discounted Markov games where the goal is to design a policy optimization algorithm for either agent satisfying two properties. First, the player does not need to know the policy of the opponent to update its policy. Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game. To this end, we construct a meta algorithm, dubbed as $\texttt{Homotopy-PO}$, which provably finds a Nash equilibrium at a global linear rate. In particular, $\texttt{Homotopy-PO}$ interweaves two base algorithms $\texttt{Local-Fast}$ and $\texttt{Global-Slow}$ via homotopy continuation. $\texttt{Local-Fast}$ is an algorithm that enjoys local linear convergence while $\texttt{Global-Slow}$ is an algorithm that converges globally but at a slower sublinear rate. By switching between these two base algorithms, $\texttt{Global-Slow}$ essentially serves as a ``guide'' which identifies a benign neighborhood where $\texttt{Local-Fast}$ enjoys fast convergence. However, since the exact size of such a neighborhood is unknown, we apply a doubling trick to switch between these two base algorithms. The switching scheme is delicately designed so that the aggregated performance of the algorithm is driven by $\texttt{Local-Fast}$. Furthermore, we prove that $\texttt{Local-Fast}$ and $\texttt{Global-Slow}$ can both be instantiated by variants of optimistic gradient descent/ascent (OGDA) method, which is of independent interest.
翻译:我们研究两人零和折扣马尔可夫博弈中的去中心化学习问题,目标是为任意一个智能体设计满足两个性质的政策优化算法。第一,智能体更新其政策时无需知道对手的政策。第二,当两个智能体都采用该算法时,它们的联合政策收敛到博弈的纳什均衡。为此,我们构建了一个元算法,称为$\texttt{Homotopy-PO}$,该算法能全局以线性速率找到纳什均衡。具体地,$\texttt{Homotopy-PO}$通过同伦延拓交织两个基础算法$\texttt{Local-Fast}$和$\texttt{Global-Slow}$。$\texttt{Local-Fast}$是享有局部线性收敛性的算法,而$\texttt{Global-Slow}$是全局收敛但速率较慢的次线性算法。通过在这两个基础算法间切换,$\texttt{Global-Slow}$本质上充当“向导”,识别出$\texttt{Local-Fast}$能快速收敛的良性邻域。然而,由于该邻域的确切大小未知,我们应用倍增技巧来切换这两个基础算法。切换方案精心设计,使得算法的整体性能由$\texttt{Local-Fast}$驱动。此外,我们证明了$\texttt{Local-Fast}$和$\texttt{Global-Slow}$均可由乐观梯度下降/上升(OGDA)方法的变体实例化,这具有独立的研究价值。