Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made towards understanding the sample efficiency of Q-learning. Consider a $\gamma$-discounted infinite-horizon MDP with state space $\mathcal{S}$ and action space $\mathcal{A}$: to yield an entrywise $\varepsilon$-approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{2}}$, which fails to match existing minimax lower bounds. This gives rise to natural questions: what is the sharp sample complexity of Q-learning? Is Q-learning provably sub-optimal? This paper addresses these questions for the synchronous setting: (1) when $|\mathcal{A}|=1$ (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as $\frac{|\mathcal{S}|}{(1-\gamma)^3\varepsilon^2}$ (up to log factor); (2) when $|\mathcal{A}|\geq 2$, we settle the sample complexity of Q-learning to be on the order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^4\varepsilon^2}$ (up to log factor). Our theory unveils the strict sub-optimality of Q-learning when $|\mathcal{A}|\geq 2$, and rigorizes the negative impact of over-estimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be $\frac{1}{(1-\gamma)^4}$.
翻译:摘要:Q学习作为强化学习的核心方法,旨在以无模型方式学习马尔可夫决策过程的最优Q函数。针对同步设置(即每次迭代中从生成模型独立采样所有状态-动作对),学界在理解Q学习的样本效率方面取得了重要进展。考虑一个具有状态空间$\mathcal{S}$和动作空间$\mathcal{A}$的$\gamma$-折扣无限时域MDP:为获得最优Q函数的逐元素$\varepsilon$-近似,当前Q学习理论所需的样本量超过$\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{2}}$量级,未能匹配现有的极小极大下界。这自然引发两个问题:Q学习的精确样本复杂度是多少?Q学习是否被证明是次优的?本文针对同步设置回答这些问题:(1)当$|\mathcal{A}|=1$时(此时Q学习退化为TD学习),我们证明TD学习的样本复杂度达到极小极大最优,其缩放比例为$\frac{|\mathcal{S}|}{(1-\gamma)^3\varepsilon^2}$(忽略对数因子);(2)当$|\mathcal{A}|\geq 2$时,我们确定Q学习的样本复杂度为$\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^4\varepsilon^2}$量级(忽略对数因子)。理论揭示了当$|\mathcal{A}|\geq 2$时Q学习的严格次优性,并严谨化了Q学习中过估计的负面影响。最后,我们将分析扩展至异步Q学习(即马尔可夫样本情形),将其样本复杂度的时域依赖性优化至$\frac{1}{(1-\gamma)^4}$。