No-regret learning has a long history of being closely connected to game theory. Recent works have devised uncoupled no-regret learning dynamics that, when adopted by all the players in normal-form games, converge to various equilibrium solutions at a near-optimal rate of $\widetilde{O}(T^{-1})$, a significant improvement over the $O(1/\sqrt{T})$ rate of classic no-regret learners. However, analogous convergence results are scarce in Markov games, a more generic setting that lays the foundation for multi-agent reinforcement learning. In this work, we close this gap by showing that the optimistic-follow-the-regularized-leader (OFTRL) algorithm, together with appropriate value update procedures, can find $\widetilde{O}(T^{-1})$-approximate (coarse) correlated equilibria in full-information general-sum Markov games within $T$ iterations. Numerical results are also included to corroborate our theoretical findings.
翻译:无遗憾学习与博弈论有着悠久而紧密的联系。近期研究提出了非耦合的无遗憾学习动态,当所有玩家在正规形式博弈中采用该动态时,能以接近最优的速率 $\widetilde{O}(T^{-1})$ 收敛到各种均衡解,这相较于经典无遗憾学习器的 $O(1/\sqrt{T})$ 速率有显著提升。然而,在更通用的马尔可夫博弈(多智能体强化学习的基础设定)中,类似的收敛结果仍然缺乏。本文通过证明乐观跟随正则化领导者(OFTRL)算法配合适当的值更新过程,能在 $T$ 次迭代内于完全信息一般和马尔可夫博弈中找到 $\widetilde{O}(T^{-1})$-近似(粗)相关均衡,填补了这一空白。文中还包含数值结果以支持我们的理论发现。