In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-γ)^{-1}\log((1-γ)^{-1}ε^{-1}))$ for computing an $ε$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.
翻译:本文研究表明,强化学习中的核心算法——自然策略梯度——可精确表述为策略迭代的平滑与平均化形式。具体而言,我们提出双重平滑策略迭代(DSPI),这是一种贝尔曼算子框架,其中每个策略通过对过去$Q$函数的加权平均施加正则化贪婪步骤获得。DSPI将策略迭代、对偶平均策略迭代、自然策略梯度以及更一般的策略对偶平均方法作为特例。仅利用平滑贝尔曼算子的单调性与压缩性,我们证明了DSPI的无分布全局几何收敛性。因此,标准自然策略梯度与策略对偶平均方法在计算$ε$-最优策略时,无需修改马尔可夫决策过程(MDP)、无需在更新中引入除镜像映射外的正则化、无需采用自适应的轨迹相关步长,即可达到$\mathcal{O}((1-γ)^{-1}\log((1-γ)^{-1}ε^{-1}))$的迭代复杂度。针对非正则化贪婪情形(即对偶平均策略迭代),我们还证明了有限终止性。该贝尔曼算子框架可进一步扩展至带线性函数逼近的折扣MDP及随机最短路径问题。