In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates generated by the asynchronous Q-learning algorithm with a polynomial stepsize $k^{-ω},\, ω\in (1/2, 1]$. Assuming that the sequence of state-action-next-state triples $(s_k, a_k, s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, we establish a rate of order up to $n^{-1/6} \log^{4} (nS A)$ over the class of hyper-rectangles, where $n$ is the number of samples used by the algorithm and $S$ and $A$ denote the numbers of states and actions, respectively. To obtain this result, we prove a high-dimensional central limit theorem for sums of martingale differences, which may be of independent interest. Finally, we present bounds for high-order moments for the algorithm's last iterate.
翻译:本文为异步Q-learning算法产生的Polyak-Ruppert平均迭代序列(采用多项式步长$k^{-ω},\, ω\in (1/2, 1]$)推导了高维中心极限定理下的收敛速率。假设状态-动作-下一状态三元组序列$(s_k, a_k, s_{k+1})_{k \geq 0}$构成一致几何遍历的马尔可夫链,我们建立了超矩形类上阶数高达$n^{-1/6} \log^{4} (nS A)$的收敛速率,其中$n$为算法使用的样本数量,$S$和$A$分别表示状态和动作的数量。为获得此结果,我们证明了鞅差序列的高维中心极限定理,该定理可能具有独立的研究价值。最后,我们给出了算法末次迭代的高阶矩界。