Two-time-scale Stochastic Approximation (SA) is an iterative algorithm with applications in reinforcement learning and optimization. Prior finite time analysis of such algorithms has focused on fixed point iterations with mappings contractive under Euclidean norm. Motivated by applications in reinforcement learning, we give the first mean square bound on non linear two-time-scale SA where the iterations have arbitrary norm contractive mappings and Markovian noise. We show that the mean square error decays at a rate of $O(1/n^{2/3})$ in the general case, and at a rate of $O(1/n)$ in a special case where the slower timescale is noiseless. Our analysis uses the generalized Moreau envelope to handle the arbitrary norm contractions and solutions of Poisson equation to deal with the Markovian noise. By analyzing the SSP Q-Learning algorithm, we give the first $O(1/n)$ bound for an algorithm for asynchronous control of MDPs under the average reward criterion. We also obtain a rate of $O(1/n)$ for Q-Learning with Polyak-averaging and provide an algorithm for learning Generalized Nash Equilibrium (GNE) for strongly monotone games which converges at a rate of $O(1/n^{2/3})$.
翻译:双时间尺度随机逼近(SA)是一种迭代算法,在强化学习与优化领域有广泛应用。此前对此类算法的有限时间分析主要集中于在欧几里得范数下具有收缩映射的不动点迭代。受强化学习应用的启发,我们首次给出了非线性双时间尺度SA的均方误差界,其中迭代具有任意范数收缩映射与马尔可夫噪声。我们证明,在一般情况下均方误差以 $O(1/n^{2/3})$ 的速率衰减,而在慢时间尺度无噪声的特殊情况下,衰减速率可达 $O(1/n)$。我们的分析采用广义Moreau包络处理任意范数收缩,并利用泊松方程的解处理马尔可夫噪声。通过分析SSP Q-Learning算法,我们首次给出了在平均奖励准则下异步控制MDP的算法的 $O(1/n)$ 误差界。我们还获得了采用Polyak平均的Q-Learning的 $O(1/n)$ 收敛速率,并提出了一个学习强单调博弈的广义纳什均衡(GNE)的算法,其收敛速率为 $O(1/n^{2/3})$。