From Set Convergence to Pointwise Convergence: Finite-Time Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes

This work presents the first finite-time analysis for the last-iterate convergence of average-reward $Q$-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes, which serve as local clocks for each state-action pair. We show that, under appropriate assumptions, the iterates generated by this $Q$-learning algorithm converge at a rate of $\tilde{\mathcal{O}}(1/k)$ (in the mean-square sense) to the optimal $Q$-function in the span seminorm. Moreover, by adding a centering step to the algorithm, we further establish pointwise mean-square convergence to the centered optimal $Q$-function, also at a rate of $\tilde{\mathcal{O}}(1/k)$. To prove these results, we show that adaptive stepsizes are necessary, as without them, the algorithm fails to converge to the correct target. In addition, adaptive stepsizes can be interpreted as a form of implicit importance sampling that counteracts the effects of asynchronous updates. Technically, the use of adaptive stepsizes makes each $Q$-learning update depend on the entire sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates. The tools developed in this work are likely to be broadly applicable to the analysis of general SA algorithms with adaptive stepsizes.

翻译：本文首次对异步实现的平均奖励$Q$-学习的最后迭代收敛性进行了有限时间分析。我们研究的算法的一个关键特征是采用自适应步长，这些步长为每个状态-动作对充当局部时钟。我们证明，在适当假设下，该$Q$-学习算法生成的迭代以$\tilde{\mathcal{O}}(1/k)$的速率（在均方意义上）收敛到跨度半范数中的最优$Q$-函数。此外，通过在算法中添加中心化步骤，我们进一步建立了向居中化的最优$Q$-函数的逐点均方收敛，速率同样为$\tilde{\mathcal{O}}(1/k)$。为证明这些结果，我们展示了自适应步长的必要性，因为若无此步骤，算法将无法收敛到正确目标。同时，自适应步长可解释为一种隐式重要性采样形式，用于抵消异步更新的影响。技术上，自适应步长的使用使得每次$Q$-学习更新依赖于整个样本历史，引入了强相关性，并使算法成为非马尔可夫随机逼近（SA）方案。我们克服这一挑战的方法包括：（1）将非马尔可夫SA重新表述为时间非齐次马尔可夫过程，以及（2）结合几乎必然时变界、条件论证和马尔可夫链集中不等式来打破自适应步长与迭代之间的强相关性。本文开发的工具可能广泛适用于具有自适应步长的通用SA算法的分析。