Policy Mirror Descent (PMD) stands as a versatile algorithmic framework encompassing several seminal policy gradient algorithms such as natural policy gradient, with connections with state-of-the-art reinforcement learning (RL) algorithms such as TRPO and PPO. PMD can be seen as a soft Policy Iteration algorithm implementing regularized 1-step greedy policy improvement. However, 1-step greedy policies might not be the best choice and recent remarkable empirical successes in RL such as AlphaGo and AlphaZero have demonstrated that greedy approaches with respect to multiple steps outperform their 1-step counterpart. In this work, we propose a new class of PMD algorithms called $h$-PMD which incorporates multi-step greedy policy improvement with lookahead depth $h$ to the PMD update rule. To solve discounted infinite horizon Markov Decision Processes with discount factor $\gamma$, we show that $h$-PMD which generalizes the standard PMD enjoys a faster dimension-free $\gamma^h$-linear convergence rate, contingent on the computation of multi-step greedy policies. We propose an inexact version of $h$-PMD where lookahead action values are estimated. Under a generative model, we establish a sample complexity for $h$-PMD which improves over prior work. Finally, we extend our result to linear function approximation to scale to large state spaces. Under suitable assumptions, our sample complexity only involves dependence on the dimension of the feature map space instead of the state space size.
翻译:策略镜像下降(PMD)是一种通用算法框架,涵盖了自然策略梯度等多项经典策略梯度算法,并与TRPO、PPO等最先进强化学习算法存在关联。PMD可视为一种软策略迭代算法,通过正则化单步贪心策略改进实现。然而,单步贪心策略未必是最优选择——AlphaGo和AlphaZero等近期在RL领域的突破性实证成功表明,多步贪心方法显著优于单步方案。本文提出一类新型PMD算法——$h$-PMD,该算法将前瞻深度$h$的多步贪心策略改进引入PMD更新规则。针对折扣因子为$\gamma$的无限折扣马尔可夫决策过程,我们证明在可计算多步贪心策略的条件下,作为标准PMD推广的$h$-PMD具有更快的无维度依赖$\gamma^h$线性收敛速度。进一步提出$h$-PMD的非精确版本,通过估计前瞻动作值实现。在生成模型假设下,我们建立了$h$-PMD的样本复杂度,该结果优于现有方法。最后,将结果扩展至线性函数逼近以应对大规模状态空间。在合理假设下,样本复杂度仅依赖特征空间维度而非状态空间规模。