We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for maximizing the infinite horizon discounted reward in a Markov decision process (MDP). Optimal worst-case complexity results have been developed for tabular RL problems in this setting, leading to a sample complexity dependence on $\gamma$ and $\epsilon$ of the form $\tilde \Theta((1-\gamma)^{-3}\epsilon^{-2})$, where $\gamma$ denotes the discount factor and $\epsilon$ is the solution error tolerance. However, in many applications of interest, the optimal policy (or all policies) induces mixing. We establish that in such settings, the optimal sample complexity dependence is $\tilde \Theta(t_{\text{mix}}(1-\gamma)^{-2}\epsilon^{-2})$, where $t_{\text{mix}}$ is the total variation mixing time. Our analysis is grounded in regeneration-type ideas, which we believe are of independent interest, as they can be used to study RL problems for general state space MDPs.
翻译:我们研究了表格型强化学习在马尔可夫决策过程中最大化无限水平折扣奖励的最优样本复杂度理论。在该设置下,表格型强化学习问题已发展出最优最坏情况复杂度结果,其样本复杂度对$\gamma$和$\epsilon$的依赖关系为$\tilde \Theta((1-\gamma)^{-3}\epsilon^{-2})$,其中$\gamma$表示折扣因子,$\epsilon$为解误差容限。然而在许多实际应用中,最优策略(或所有策略)会引发混合性。我们证明在此类情形下,最优样本复杂度依赖关系为$\tilde \Theta(t_{\text{mix}}(1-\gamma)^{-2}\epsilon^{-2})$,其中$t_{\text{mix}}$为总变差混合时间。我们的分析基于再生型思想,这一方法本身具有独立研究价值,可用于研究一般状态空间马尔可夫决策过程中的强化学习问题。