We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is uncoupled, convergent, and rational, with non-asymptotic convergence rates. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an $O(t^{-\frac{1}{8}})$ last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of $O(t^{-\frac{1}{9+\varepsilon}})$ for any $\varepsilon>0$. Finally, we study Markov games without any assumptions on the dynamics, and show a path convergence rate, which is a new notion of convergence we defined, of $O(t^{-\frac{1}{10}})$. Our algorithm removes the coordination and prior knowledge requirement of [Wei et al., 2021], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [Chen et al., 2021, Cen et al., 2021] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.
翻译:我们重新审视双人零和马尔可夫博弈中的学习问题,聚焦于开发一种具有非渐进收敛速度的解耦、收敛且理性的算法。我们首先以无状态矩阵博弈且具有赌博反馈的情况作为热身,证明了 $O(t^{-\frac{1}{8}})$ 的最后迭代收敛速度。据我们所知,这是首个在仅能访问赌博反馈时获得有限最后迭代收敛速度的结果。我们将结果扩展至不可约马尔可夫博弈,对任意 $\varepsilon>0$ 给出了 $O(t^{-\frac{1}{9+\varepsilon}})$ 的最后迭代收敛速度。最后,我们研究了对动力学无任何假设的马尔可夫博弈,并展示了 $O(t^{-\frac{1}{10}})$ 的路径收敛速度——这是我们定义的一种新的收敛概念。我们的算法去除了[Wei 等人,2021]中所需的协调与先验知识要求,该工作与我们针对不可约马尔可夫博弈追求相同目标。本算法与[Chen 等人,2021,Cen 等人,2021]相关,并同样基于熵正则化技术。然而,我们消除了其对熵值通信的要求,使算法完全解耦。