Policy gradient methods have recently been shown to enjoy global convergence at a $\Theta(1/t)$ rate in the non-regularized tabular softmax setting. Accordingly, one important research question is whether this convergence rate can be further improved, with only first-order updates. In this paper, we answer the above question from the perspective of momentum by adapting the celebrated Nesterov's accelerated gradient (NAG) method to reinforcement learning (RL), termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving faster global convergence, we formally show that with the true gradient, APG with softmax policy parametrization converges to an optimal policy at a $\tilde{O}(1/t^2)$ rate. To the best of our knowledge, this is the first characterization of the global convergence rate of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the initialization, APG could end up reaching a locally nearly-concave regime, where APG could benefit significantly from the momentum, within finite iterations. By means of numerical validation, we confirm that APG exhibits $\tilde{O}(1/t^2)$ rate as well as show that APG could significantly improve the convergence behavior over the standard policy gradient.
翻译:近期研究表明,策略梯度方法在无正则化的表格型softmax设定下可实现$\Theta(1/t)$的全局收敛速率。因此,一个重要的研究问题是:能否仅通过一阶更新进一步改进该收敛速率?本文从动量角度出发,通过将经典的Nesterov加速梯度(NAG)方法适配到强化学习(RL),提出\textit{加速策略梯度}(APG)算法来回应上述问题。为展示APG在实现更快全局收敛方面的潜力,我们严格证明了:在使用真实梯度的条件下,采用softmax策略参数化的APG能以$\tilde{O}(1/t^2)$的速率收敛到最优策略。据我们所知,这是首次刻画NAG在强化学习背景下的全局收敛速率。值得注意的是,我们的分析依赖于一个有趣发现:无论初始化如何,APG都能在有限迭代次数内进入局部近凹区域,在该区域中动量显著提升算法性能。通过数值验证,我们确认了APG的$\tilde{O}(1/t^2)$收敛速率,并证明其相比标准策略梯度能显著改善收敛行为。