We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is $uncoupled$, $convergent$, and $rational$, with non-asymptotic convergence rates. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an $\mathcal{O}(t^{-\frac{1}{8}})$ last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of $\mathcal{O}(t^{-\frac{1}{9+\varepsilon}})$ for any $\varepsilon>0$. Finally, we study Markov games without any assumptions on the dynamics, and show a $path convergence$ rate, which is a new notion of convergence we defined, of $\mathcal{O}(t^{-\frac{1}{10}})$. Our algorithm removes the synchronization and prior knowledge requirement of [Wei et al., 2021], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [Chen et al., 2021, Cen et al., 2021] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.
翻译:我们重新审视了两玩家零和马尔可夫博弈中的学习问题,重点在于开发一种具有非耦合性、收敛性和理性且具备非渐近收敛率的算法。我们从无状态矩阵博弈(带赌博机反馈)作为热身案例开始,证明了$\mathcal{O}(t^{-\frac{1}{8}})$的最终迭代收敛率。据我们所知,这是首个在仅能访问赌博机反馈条件下获得有限最终迭代收敛率的结果。我们将结果扩展到不可约马尔可夫博弈的情形,对于任意$\varepsilon>0$,给出了$\mathcal{O}(t^{-\frac{1}{9+\varepsilon}})$的最终迭代收敛率。最后,我们研究了无任何动力学假设的马尔可夫博弈,并展示了$\mathcal{O}(t^{-\frac{1}{10}})$的路径收敛率——这是我们定义的一种新的收敛概念。我们的算法消除了[Wei et al., 2021]中针对不可约马尔可夫博弈追求相同目标时所需的同步和先验知识要求。该算法与[Chen et al., 2021, Cen et al., 2021]相关,并同样基于熵正则化技术。然而,我们移除了他们对熵值通信的需求,使我们的算法完全非耦合。