This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL). By incorporating this advantage gap function into the design of step size rules and deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy, we demonstrate that policy gradient methods can solve MDPs in strongly-polynomial time. To the best of our knowledge, this is the first time that such strong convergence properties have been established for policy gradient methods. Moreover, in the stochastic setting, where only stochastic estimates of policy gradients are available, we show that the advantage gap function provides close approximations of the optimality gap for each individual state and exhibits a sublinear rate of convergence at every state. The advantage gap function can be easily estimated in the stochastic case, and when coupled with easily computable upper bounds on policy values, they provide a convenient way to validate the solutions generated by policy gradient methods. Therefore, our developments offer a principled and computable measure of optimality for RL, whereas current practice tends to rely on algorithm-to-algorithm or baselines comparisons with no certificate of optimality.
翻译:本文针对有限状态与动作的马尔可夫决策过程(MDP)与强化学习(RL),提出了一种称为优势间隙函数的新型终止准则。通过将该优势间隙函数融入步长规则的设计,并推导出一个与最优策略的平稳状态分布无关的新线性收敛速率,我们证明了策略梯度方法能够在强多项式时间内求解MDP。据我们所知,这是首次为策略梯度方法建立如此强的收敛性质。此外,在仅能获得策略梯度随机估计的随机设置中,我们证明了优势间隙函数能够为每个独立状态提供最优性间隙的紧密近似,并在每个状态上展现出次线性收敛速率。优势间隙函数在随机情形下易于估计,当与易于计算的价值上界结合时,它们为验证策略梯度方法生成的解提供了一种便捷途径。因此,我们的研究为强化学习提供了一种原则性的、可计算的最优性度量,而当前实践往往依赖于算法间或与基线的比较,且缺乏最优性证明。