Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning

Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving fast convergence, we formally prove that with the true gradient and under the softmax policy parametrization, APG converges to an optimal policy at rates: (i) $\tilde{O}(1/t^2)$ with constant step sizes; (ii) $O(e^{-ct})$ with exponentially-growing step sizes. To the best of our knowledge, this is the first characterization of the convergence rates of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the parameter initialization, APG ends up entering a locally nearly-concave regime, where APG can significantly benefit from the momentum, within finite iterations. Through numerical validation and experiments on the Atari 2600 benchmarks, we confirm that APG exhibits a $\tilde{O}(1/t^2)$ rate with constant step sizes and a linear convergence rate with exponentially-growing step sizes, significantly improving convergence over the standard PG.

翻译：在强化学习领域，已有多种策略梯度加速方法得到分析。然而，广泛使用的基于动量的加速方法在策略梯度中的理论理解仍存在较大空白。针对这一不足，本文将经典的Nesterov加速梯度方法适配于强化学习中的策略优化，提出了加速策略梯度（APG）。为证明APG实现快速收敛的潜力，我们严格证明了在真实梯度条件下且采用softmax策略参数化时，APG能够以以下速率收敛至最优策略：（i）使用常数步长时达到$\tilde{O}(1/t^2)$；（ii）使用指数增长步长时达到$O(e^{-ct})$。据我们所知，这是首次在强化学习背景下刻画NAG的收敛率。值得注意的是，我们的分析基于一个有趣发现：无论参数初始化如何，APG都将在有限迭代内进入局部近凹区域，在此区域动量可显著提升收敛性能。通过数值验证和在Atari 2600基准上的实验，我们确认APG在常数步长下呈现$\tilde{O}(1/t^2)$收敛率，在指数增长步长下呈现线性收敛率，相比标准策略梯度显著提升了收敛速度。

相关内容

关注 0

Pacific Graphics是亚洲图形协会的旗舰会议。作为一个非常成功的会议系列，太平洋图形公司为太平洋沿岸以及世界各地的研究人员，开发人员，从业人员提供了一个高级论坛，以介绍和讨论计算机图形学及相关领域的新问题，解决方案和技术。太平洋图形会议的目的是召集来自各个领域的研究人员，以展示他们的最新成果，开展合作并为研究领域的发展做出贡献。会议将包括定期的论文讨论会，进行中的讨论会，教程以及由与计算机图形学和交互系统相关的所有领域的国际知名演讲者的演讲。官网地址：http://dblp.uni-trier.de/db/conf/pg/index.html

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日