We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
翻译:本文研究了同步与异步Q学习在平均奖励马尔可夫决策过程中的收敛速率,其中收缩性的缺失构成了根本性挑战。现有的非渐近结果通过两种方式克服这一挑战:一是施加强假设以强制半范数收缩,二是依赖折扣或片段式马尔可夫决策过程作为逐次逼近,但这要么需要未知参数,要么导致次优的样本复杂度。在本工作中,在可达性假设下,我们为同步与异步Q学习的一个简单变体建立了最优的$\widetilde{O}(\varepsilon^{-2})$样本复杂度保证(忽略对数因子),该变体从惰性化动态中采样,即系统以固定概率保持在当前状态。我们分析的核心在于构造一个与实例相关的半范数,并证明在对马尔可夫决策过程进行惰性变换后,贝尔曼算子在此半范数下具有一步收缩性。