This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL). By incorporating this advantage gap function into the design of step size rules and deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy, we demonstrate that policy gradient methods can solve MDPs in strongly-polynomial time. To the best of our knowledge, this is the first time that such strong convergence properties have been established for policy gradient methods. Moreover, in the stochastic setting, where only stochastic estimates of policy gradients are available, we show that the advantage gap function provides close approximations of the optimality gap for each individual state and exhibits a sublinear rate of convergence at every state. The advantage gap function can be easily estimated in the stochastic case, and when coupled with easily computable upper bounds on policy values, they provide a convenient way to validate the solutions generated by policy gradient methods. Therefore, our developments offer a principled and computable measure of optimality for RL, whereas current practice tends to rely on algorithm-to-algorithm or baselines comparisons with no certificate of optimality.
翻译:本文针对有限状态和动作的马尔可夫决策过程(MDP)及强化学习(RL),提出了一种新型终止判据——优势差距函数。通过将优势差距函数融入步长规则设计,并推导出独立于最优策略稳态状态分布的线性收敛速率,我们证明策略梯度方法能在强多项式时间内求解MDP。据我们所知,这是首次为策略梯度方法建立如此强收敛性质。此外,在仅能获取策略梯度随机估计的随机场景中,我们发现优势差距函数能精确逼近每个独立状态的最优性差距,并在每个状态下呈现次线性收敛速率。优势差距函数在随机情形下易于估计,且与可轻松计算的策略值上界结合使用时,为验证策略梯度方法生成的解提供了便捷途径。因此,我们的研究为强化学习提供了具有理论依据且可计算的最优性度量,突破了当前实践依赖算法间比较或基线对比而无法保证最优性的局限。