We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({\epsilon^{-2}})$ sample complexity and $\mathcal{O}(\epsilon^{-1})$ iteration complexity with general parameterization where $\epsilon$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}{\epsilon})$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(\epsilon^{-\frac{1}{2}})$ and simultaneously matches their state-of-the-art iteration complexity.
翻译:我们研究了为无限视界折扣奖励马尔可夫决策过程设计样本高效学习算法的问题。具体而言,我们提出了加速自然策略梯度(ANPG)算法,该算法利用加速随机梯度下降过程来获取自然策略梯度。在一般参数化设置下,ANPG实现了$\mathcal{O}(\epsilon^{-2})$的样本复杂度和$\mathcal{O}(\epsilon^{-1})$的迭代复杂度,其中$\epsilon$定义了最优性误差。这比现有最优样本复杂度提升了$\log(\frac{1}{\epsilon})$因子。ANPG是一阶算法,与现有部分文献不同,它无需假设重要性采样(IS)权重的方差有上界这一不可验证的条件。在无海森矩阵且无重要性采样的算法类别中,ANPG将已知最优样本复杂度降低了$\mathcal{O}(\epsilon^{-\frac{1}{2}})$因子,同时达到了与现有最优方法相当的迭代复杂度。