We develop several provably efficient model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). We consider both online setting and the setting with access to a simulator. In the online setting, we propose model-free RL algorithms based on reference-advantage decomposition. Our algorithm achieves $\widetilde{O}(S^5A^2\mathrm{sp}(h^*)\sqrt{T})$ regret after $T$ steps, where $S\times A$ is the size of state-action space, and $\mathrm{sp}(h^*)$ the span of the optimal bias function. Our results are the first to achieve optimal dependence in $T$ for weakly communicating MDPs. In the simulator setting, we propose a model-free RL algorithm that finds an $\epsilon$-optimal policy using $\widetilde{O} \left(\frac{SA\mathrm{sp}^2(h^*)}{\epsilon^2}+\frac{S^2A\mathrm{sp}(h^*)}{\epsilon} \right)$ samples, whereas the minimax lower bound is $\Omega\left(\frac{SA\mathrm{sp}(h^*)}{\epsilon^2}\right)$. Our results are based on two new techniques that are unique in the average-reward setting: 1) better discounted approximation by value-difference estimation; 2) efficient construction of confidence region for the optimal bias function with space complexity $O(SA)$.
翻译:我们针对无限时域平均奖励马尔可夫决策过程(MDPs)提出了几种可证明高效的无模型强化学习(RL)算法。我们同时考虑了在线设置和具有模拟器访问权限的设置。在在线设置中,我们提出了基于参考-优势分解的无模型RL算法。该算法在T步后实现了$\widetilde{O}(S^5A^2\mathrm{sp}(h^*)\sqrt{T})$的遗憾值,其中$S\times A$表示状态-动作空间大小,$\mathrm{sp}(h^*)$为最优偏差函数的跨度。我们的结果首次在弱通信MDPs中实现了对T的最优依赖关系。在模拟器设置中,我们提出了一种无模型RL算法,该算法使用$\widetilde{O} \left(\frac{SA\mathrm{sp}^2(h^*)}{\epsilon^2}+\frac{S^2A\mathrm{sp}(h^*)}{\epsilon} \right)$个样本即可找到$\epsilon$-最优策略,而极小化最优下界为$\Omega\left(\frac{SA\mathrm{sp}(h^*)}{\epsilon^2}\right)$。我们的成果基于平均奖励设置中独有的两项新技术:1)通过值差估计实现更优的折扣近似;2)以$O(SA)$空间复杂度高效构建最优偏差函数的置信域。