We study the sample complexity of learning an $\epsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound $\tilde{O}(SA\frac{H}{\epsilon^2})$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\epsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. We further investigate sample complexity in general (non-weakly-communicating) average-reward MDPs. We argue a new transient time parameter $B$ is necessary, establish an $\tilde{O}(SA\frac{B+H}{\epsilon^2})$ complexity bound, and prove a matching (up to log factors) minimax lower bound. Both results are based on reducing the average-reward MDP to a discounted MDP, which requires new ideas in the general setting. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\tilde{\Omega}\left(SA\frac{H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice to learn an $\epsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma\geq 1-1/H$, and $\tilde{\Omega}\left(SA\frac{B+H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice in general MDPs when $\gamma\geq 1-\frac{1}{B+H}$. Both these results circumvent the well-known lower bound of $\tilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\epsilon^2}\right)$ for arbitrary $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span and transient time parameters. The weakly communicating bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.
翻译:我们研究了在生成模型下,于平均奖励马尔可夫决策过程(MDP)中学习一个$\epsilon$-最优策略的样本复杂度。对于弱通信MDP,我们建立了复杂度界$\tilde{O}(SA\frac{H}{\epsilon^2})$,其中$H$为最优策略偏置函数的跨度,$SA$为状态-动作空间的基数。该结果是首个在所有参数$S,A,H$和$\epsilon$上达到极小最大最优(忽略对数因子)的结论,改进了现有工作(其要么假设所有策略具备一致有界混合时间,要么对参数依赖次优)。我们进一步研究了一般(非弱通信)平均奖励MDP的样本复杂度。论证了新的瞬态时间参数$B$的必要性,建立了$\tilde{O}(SA\frac{B+H}{\epsilon^2})$复杂度界,并证明了匹配(忽略对数因子)的极小最大下界。两个结论均基于将平均奖励MDP约化为折扣MDP,这在一般场景中需要新思路。为确立该约化的最优性,我们改进了$\gamma$-折扣MDP的界:在弱通信MDP中,当$\gamma\geq 1-1/H$时,$\tilde{\Omega}\left(SA\frac{H}{(1-\gamma)^2\epsilon^2}\right)$个样本即可学习$\epsilon$-最优策略;在一般MDP中,当$\gamma\geq 1-\frac{1}{B+H}$时,$\tilde{\Omega}\left(SA\frac{B+H}{(1-\gamma)^2\epsilon^2}\right)$个样本足够。这两个结果均规避了任意$\gamma$-折扣MDP的已知下界$\tilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\epsilon^2}\right)$。我们的分析通过跨度与瞬态时间参数,建立了某些实例依赖方差参数的上界。弱通信MDP的界比基于混合时间或直径的界更紧,可能具有更广泛的应用价值。