We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})$ and a lower bound of $\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $\epsilon$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2})$. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.
翻译:我们解决了关于策略学习样本复杂度的开放性问题,该问题旨在最大化与一致遍历马尔可夫决策过程(MDP)相关的长期平均奖励,并假设存在一个生成模型。在此背景下,现有文献提供了样本复杂度的上界 $\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})$ 和下界 $\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})$。在这些表达式中,$|S|$ 和 $|A|$ 分别表示状态空间和动作空间的基数,$t_{\text{mix}}$ 作为总变差混合时间的统一上界,而 $\epsilon$ 表示误差容限。因此,仍存在一个显著的 $t_{\text{mix}}$ 差距有待弥合。我们的主要贡献是开发了一种针对平均奖励 MDP 最优策略的估计器,其样本复杂度为 $\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2})$。这标志着首个达到文献下界的算法和分析。我们的新算法借鉴了 Li 等人(2020)、Jin 和 Sidford(2021)以及 Wang 等人(2023)的思想。此外,我们通过数值实验验证了我们的理论发现。