We settle the sample complexity of policy learning for the maximization of the long run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})$ and a lower bound of $\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $\epsilon$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is to establish an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2})$, effectively reaching the lower bound in the literature. This is achieved by combining algorithmic ideas in Jin and Sidford (2021) with those of Li et al. (2020).
翻译:我们针对与一致遍历马尔可夫决策过程(MDP)相关的长期平均报酬最大化问题,在设定生成模型的前提下,确定了策略学习的样本复杂度。在此背景下,现有文献给出了$\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})$的样本复杂度上界和$\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})$的下界。其中,$|S|$和$|A|$分别表示状态空间和动作空间的基数,$t_{\text{mix}}$用作全变差混合时间的统一上限,$\epsilon$表示误差容限。因此,仍存在显著的$t_{\text{mix}}$差距有待弥合。我们的主要贡献在于,为平均报酬MDP的最优策略建立了一个样本复杂度为$\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2})$的估计器,有效达到了文献中的下界。这是通过结合Jin和Sidford(2021)的算法思想与Li等人(2020)的工作实现的。