Stochastic Approximation (SA) is a widely used algorithmic approach in various fields, including optimization and reinforcement learning (RL). Among RL algorithms, Q-learning is particularly popular due to its empirical success. In this paper, we study asynchronous Q-learning with constant stepsize, which is commonly used in practice for its fast convergence. By connecting the constant stepsize Q-learning to a time-homogeneous Markov chain, we show the distributional convergence of the iterates in Wasserstein distance and establish its exponential convergence rate. We also establish a Central Limit Theory for Q-learning iterates, demonstrating the asymptotic normality of the averaged iterates. Moreover, we provide an explicit expansion of the asymptotic bias of the averaged iterate in stepsize. Specifically, the bias is proportional to the stepsize up to higher-order terms and we provide an explicit expression for the linear coefficient. This precise characterization of the bias allows the application of Richardson-Romberg (RR) extrapolation technique to construct a new estimate that is provably closer to the optimal Q function. Numerical results corroborate our theoretical finding on the improvement of the RR extrapolation method.
翻译:随机近似(SA)是一种广泛应用于优化和强化学习(RL)等多个领域的算法方法。在RL算法中,Q-learning因其经验上的成功而特别受欢迎。本文研究实践中因快速收敛而常用的固定步长异步Q-learning。通过将固定步长Q-learning与时间齐次马尔可夫链关联起来,我们展示了迭代序列在Wasserstein距离下的分布收敛性,并建立了其指数收敛速度。我们还为Q-learning迭代序列建立了中心极限定理,证明了平均迭代的渐近正态性。此外,我们给出了平均迭代关于步长渐近偏差的显式展开式。具体而言,该偏差与步长成正比(忽略高阶项),并且我们提供了线性系数的显式表达式。这种对偏差的精确刻画使得可以应用Richardson-Romberg(RR)外推技术来构造一个新的估计量,该估计量被证明更接近最优Q函数。数值结果证实了我们关于RR外推法改进效果的理论发现。