We study two-player zero-sum stochastic games, and propose a form of independent learning dynamics called Doubly Smoothed Best-Response dynamics, which integrates a discrete and doubly smoothed variant of the best-response dynamics into temporal-difference (TD)-learning and minimax value iteration. The resulting dynamics are payoff-based, convergent, rational, and symmetric among players. Our main results provide finite-sample guarantees. In particular, we prove the first-known $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity bound for payoff-based independent learning dynamics, up to a smoothing bias. In the special case where the stochastic game has only one state (i.e., matrix games), we provide a sharper $\tilde{\mathcal{O}}(1/\epsilon)$ sample complexity. Our analysis uses a novel coupled Lyapunov drift approach to capture the evolution of multiple sets of coupled and stochastic iterates, which might be of independent interest.
翻译:我们研究两人零和随机博弈,并提出一种称为“双重平滑最优响应动力学”的独立学习动态,该动力学将最优响应动力学的离散且双重平滑变体与时间差分学习及极小极大值迭代相结合。所得动态具有基于收益、收敛性、理性及玩家间对称性。我们的主要结果提供了有限样本保证。特别地,我们证明了在平滑偏差下,基于收益的独立学习动态具有首个已知的$\tilde{\mathcal{O}}(1/\epsilon^2)$样本复杂度界。在随机博弈仅包含一个状态的特例(即矩阵博弈)中,我们给出了更优的$\tilde{\mathcal{O}}(1/\epsilon)$样本复杂度。我们的分析采用了一种新颖的耦合李雅普诺夫漂移方法,以捕捉多组耦合且随机迭代的演化过程,该方法可能具有独立的研究价值。