We consider decentralized learning for zero-sum games, where players only see their payoff information and are agnostic to actions and payoffs of the opponent. Previous works demonstrated convergence to a Nash equilibrium in this setting using double time-scale algorithms under strong reachability assumptions. We address the open problem of achieving an approximate Nash equilibrium efficiently with an uncoupled and single time-scale algorithm under weaker conditions. Our contribution is a rational and convergent algorithm, utilizing Tsallis-entropy regularization in a value-iteration-based approach. The algorithm learns an approximate Nash equilibrium in polynomial time, requiring only the existence of a policy pair that induces an irreducible and aperiodic Markov chain, thus considerably weakening past assumptions. Our analysis leverages negative drift inequalities and introduces novel properties of Tsallis entropy that are of independent interest.
翻译:我们考虑零和博弈的去中心化学习问题,其中玩家仅能观察到自身的收益信息,且对对手的行动与收益一无所知。先前的研究在强可达性假设下,通过双时间尺度算法证明了该设定下向纳什均衡的收敛性。我们致力于解决一个开放性问题:在更弱的条件下,通过非耦合的单时间尺度算法高效地获得近似纳什均衡。我们的贡献在于提出了一种理性且收敛的算法,该方法在基于价值迭代的框架中利用了Tsallis熵正则化。该算法能在多项式时间内学习到一个近似纳什均衡,仅需存在一个能诱导出不可约且非周期马尔可夫链的策略对,从而显著弱化了过往的假设。我们的分析利用了负漂移不等式,并揭示了Tsallis熵的一些新颖性质,这些性质本身也具有独立的研究价值。