Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving fast convergence, we formally prove that with the true gradient and under the softmax policy parametrization, APG converges to an optimal policy at rates: (i) $\tilde{O}(1/t^2)$ with constant step sizes; (ii) $O(e^{-ct})$ with exponentially-growing step sizes. To the best of our knowledge, this is the first characterization of the convergence rates of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the parameter initialization, APG ends up entering a locally nearly-concave regime, where APG can significantly benefit from the momentum, within finite iterations. Through numerical validation and experiments on the Atari 2600 benchmarks, we confirm that APG exhibits a $\tilde{O}(1/t^2)$ rate with constant step sizes and a linear convergence rate with exponentially-growing step sizes, significantly improving convergence over the standard PG.
翻译:在强化学习领域,已有多种策略梯度加速方法得到分析。然而,广泛使用的基于动量的加速方法在策略梯度中的理论理解仍存在较大空白。针对这一不足,本文将经典的Nesterov加速梯度方法适配于强化学习中的策略优化,提出了加速策略梯度(APG)。为证明APG实现快速收敛的潜力,我们严格证明了在真实梯度条件下且采用softmax策略参数化时,APG能够以以下速率收敛至最优策略:(i)使用常数步长时达到$\tilde{O}(1/t^2)$;(ii)使用指数增长步长时达到$O(e^{-ct})$。据我们所知,这是首次在强化学习背景下刻画NAG的收敛率。值得注意的是,我们的分析基于一个有趣发现:无论参数初始化如何,APG都将在有限迭代内进入局部近凹区域,在此区域动量可显著提升收敛性能。通过数值验证和在Atari 2600基准上的实验,我们确认APG在常数步长下呈现$\tilde{O}(1/t^2)$收敛率,在指数增长步长下呈现线性收敛率,相比标准策略梯度显著提升了收敛速度。