We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({\epsilon^{-2}})$ sample complexity and $\mathcal{O}(\epsilon^{-1})$ iteration complexity with general parameterization where $\epsilon$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}{\epsilon})$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(\epsilon^{-\frac{1}{2}})$ and simultaneously matches their state-of-the-art iteration complexity.
翻译:本文研究无限时域折扣奖励马尔可夫决策过程的样本高效学习算法设计问题。具体而言,我们提出加速自然策略梯度(Accelerated Natural Policy Gradient, ANPG)算法,该算法通过加速随机梯度下降过程获取自然策略梯度。在一般参数化框架下,ANPG算法达到$\mathcal{O}({\epsilon^{-2}})$的样本复杂度与$\mathcal{O}(\epsilon^{-1})$的迭代复杂度(其中$\epsilon$表示最优性误差),将现有最优样本复杂度提升了$\log(\frac{1}{\epsilon})$因子。作为一阶算法,ANPG不同于现有部分文献,无需依赖重要性采样(IS)权重方差有界这一不可验证假设。在无Hessian矩阵计算且无需重要性采样的算法类别中,ANPG以$\mathcal{O}(\epsilon^{-\frac{1}{2}})$因子突破已知最优样本复杂度,同时保持与当前最优方法相当的迭代复杂度。