We prove that optimistic-follow-the-regularized-leader (OFTRL), together with smooth value updates, finds an $O(T^{-1})$-approximate Nash equilibrium in $T$ iterations for two-player zero-sum Markov games with full information. This improves the $\tilde{O}(T^{-5/6})$ convergence rate recently shown in the paper Zhang et al (2022). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra $\log T$ factor. This crucial improvement enables the inductive analysis that leads to the final $O(T^{-1})$ rate.
翻译:我们证明,乐观跟随正则化领导者算法(OFTRL)结合平滑值更新,在完全信息下的双人零和马尔可夫博弈中,经过 $T$ 次迭代即可找到 $O(T^{-1})$ 近似纳什均衡。这一结果改进了张等人(2022)近期论文中展示的 $\tilde{O}(T^{-5/6})$ 收敛率。精细分析依赖于两个关键要素:首先,两位玩家的遗憾之和在马尔可夫博弈中虽未必像规范式博弈中那样非负,但近似非负。这一性质使我们能够约束学习动态的二阶路径长度。其次,我们证明了关于OFTRL部署权重的更紧代数不等式,从而消去额外的 $\log T$ 因子。这一关键改进使得归纳分析得以进行,最终得到 $O(T^{-1})$ 收敛率。