We consider two-player zero-sum stochastic games and propose a two-timescale $Q$-learning algorithm with function approximation that is payoff-based, convergent, rational, and symmetric between the two players. In two-timescale $Q$-learning, the fast-timescale iterates are updated in spirit to the stochastic gradient descent and the slow-timescale iterates (which we use to compute the policies) are updated by taking a convex combination between its previous iterate and the latest fast-timescale iterate. Introducing the slow timescale as well as its update equation marks as our main algorithmic novelty. In the special case of linear function approximation, we establish, to the best of our knowledge, the first last-iterate finite-sample bound for payoff-based independent learning dynamics of these types. The result implies a polynomial sample complexity to find a Nash equilibrium in such stochastic games. To establish the results, we model our proposed algorithm as a two-timescale stochastic approximation and derive the finite-sample bound through a Lyapunov-based approach. The key novelty lies in constructing a valid Lyapunov function to capture the evolution of the slow-timescale iterates. Specifically, through a change of variable, we show that the update equation of the slow-timescale iterates resembles the classical smoothed best-response dynamics, where the regularized Nash gap serves as a valid Lyapunov function. This insight enables us to construct a valid Lyapunov function via a generalized variant of the Moreau envelope of the regularized Nash gap. The construction of our Lyapunov function might be of broad independent interest in studying the behavior of stochastic approximation algorithms.
翻译:本文研究双人零和随机博弈,提出一种带函数近似的双时间尺度Q学习算法,该算法具有基于收益、收敛、理性及双人对称等特性。在双时间尺度Q学习中,快时间尺度迭代更新近似遵循随机梯度下降法,慢时间尺度迭代(用于计算策略)则通过其前次迭代与最新快时间尺度迭代的凸组合进行更新。引入慢时间尺度及其更新方程是我们的核心算法创新。在线性函数近似特例下,据我们所知,我们首次建立了此类基于收益的独立学习动力学的最小迭代有限样本界。该结果意味着在零和随机博弈中寻找纳什均衡具有多项式样本复杂度。为证明结论,我们将所提算法建模为双时间尺度随机逼近过程,并基于李雅普诺夫方法推导有限样本界。关键创新在于构造有效的李雅普诺夫函数以刻画慢时间尺度迭代的演化:通过变量代换,我们证明慢时间尺度迭代的更新方程类似于经典平滑最优响应动力学,其中正则化纳什间隙可作为有效的李雅普诺夫函数。这一认识使我们能够通过正则化纳什间隙的Moreau包络的广义变体构造有效的李雅普诺夫函数。该李雅普诺夫函数的构造方法可能在随机逼近算法行为研究中具有广泛独立参考价值。