We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, the convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of $\mathcal{O}(T^{-1/8})$ on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performance, with the best attainable rate being $Ω(T^{-1/4})$ in contrast to the usual $Ω(T^{-1/2})$ rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate up to constant and logarithmic factors. The first algorithm leverages a straightforward trade-off between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.
翻译:我们研究零和矩阵博弈中重复博弈和赌博反馈下的学习问题。具体来说,我们专注于开发非耦合算法,在无需玩家间通信的情况下保证最后迭代收敛到纳什均衡。尽管非赌博情况已被广泛研究,但这一设定直至近期才被探索,且其可利用性差距界为 $\mathcal{O}(T^{-1/8})$。我们证明,对于非耦合算法,保证策略轮廓收敛到纳什均衡会损害性能,其最佳可达速率为 $Ω(T^{-1/4})$,这与平均迭代收敛的常规 $Ω(T^{-1/2})$ 速率形成对比。随后,我们提出两种在常数和对数因子内达到该最优速率的算法。第一种算法利用探索与利用之间的直接权衡,而第二种则采用基于两步镜像下降方法的正则化技术。