Policy Mirror Descent (PMD) stands as a versatile algorithmic framework encompassing several seminal policy gradient algorithms such as natural policy gradient, with connections with state-of-the-art reinforcement learning (RL) algorithms such as TRPO and PPO. PMD can be seen as a soft Policy Iteration algorithm implementing regularized 1-step greedy policy improvement. However, 1-step greedy policies might not be the best choice and recent remarkable empirical successes in RL such as AlphaGo and AlphaZero have demonstrated that greedy approaches with respect to multiple steps outperform their 1-step counterpart. In this work, we propose a new class of PMD algorithms called $h$-PMD which incorporates multi-step greedy policy improvement with lookahead depth $h$ to the PMD update rule. To solve discounted infinite horizon Markov Decision Processes with discount factor $\gamma$, we show that $h$-PMD which generalizes the standard PMD enjoys a faster dimension-free $\gamma^h$-linear convergence rate, contingent on the computation of multi-step greedy policies. We propose an inexact version of $h$-PMD where lookahead action values are estimated. Under a generative model, we establish a sample complexity for $h$-PMD which improves over prior work. Finally, we extend our result to linear function approximation to scale to large state spaces. Under suitable assumptions, our sample complexity only involves dependence on the dimension of the feature map space instead of the state space size.
翻译:策略镜像下降(PMD)作为一个通用的算法框架,涵盖了自然策略梯度等多个经典策略梯度算法,并与TRPO、PPO等先进强化学习(RL)算法存在理论关联。PMD可被视为一种实现正则化单步贪心策略改进的软策略迭代算法。然而,单步贪心策略未必是最优选择,近期AlphaGo与AlphaZero等RL领域的突破性实证研究表明,基于多步前瞻的贪心方法性能优于单步贪心算法。本文提出一类称为$h$-PMD的新型PMD算法,该算法将具有前瞻深度$h$的多步贪心策略改进机制融入PMD更新规则。针对折扣因子为$\gamma$的折扣无限时域马尔可夫决策过程,我们证明广义化标准PMD的$h$-PMD算法在可计算多步贪心策略的前提下,能够实现更快的无维度$\gamma^h$线性收敛速率。我们进一步提出$h$-PMD的非精确版本,其中前瞻动作值通过估计获得。在生成模型框架下,我们建立了$h$-PMD的样本复杂度界限,该结果优于现有研究。最后,我们将结论扩展至线性函数逼近以适用于大规模状态空间。在适当假设下,所得样本复杂度仅与特征映射空间的维度相关,而与状态空间规模无关。