We study best-response type learning dynamics for two player zero-sum matrix games. We consider two settings that are distinguished by the type of information that each player has about the game and their opponent's strategy. The first setting is the full information case, in which each player knows their own and the opponent's payoff matrices and observes the opponent's mixed strategy. The second setting is the minimal information case, where players do not observe the opponent's strategy and are not aware of either of the payoff matrices (instead they only observe their realized payoffs). For this setting, also known as the radically uncoupled case in the learning in games literature, we study a two-timescale learning dynamics that combine smoothed best-response type updates for strategy estimates with a TD-learning update to estimate a local payoff function. For these dynamics, without additional exploration, we provide polynomial-time finite-sample guarantees for convergence to an $\epsilon$-Nash equilibrium.
翻译:我们研究双人零和矩阵博弈中最佳响应类学习动态。我们考虑两种设置,其区别在于每位玩家对博弈及对手策略的信息获取类型。第一种设置是完全信息情形,即每位玩家知晓自身及对手的收益矩阵,并能观测对手的混合策略。第二种设置是最小信息情形,此时玩家无法观测对手策略,且对双方收益矩阵均不知情(仅能观测自身已实现的收益)。针对该设置(在博弈学习文献中亦称为极端非耦合情形),我们研究了一种双时间尺度学习动态,该动态将策略估计的平滑最佳响应类更新与用于估计局部收益函数的TD学习更新相结合。对于这些动态,在无需额外探索的情况下,我们给出了收敛至$\epsilon$-纳什均衡的多项式时间有限样本保证。