Policy gradient algorithms have been widely applied to Markov decision processes and reinforcement learning problems in recent years. Regularization with various entropy functions is often used to encourage exploration and improve stability. This paper proposes an approximate Newton method for the policy gradient algorithm with entropy regularization. In the case of Shannon entropy, the resulting algorithm reproduces the natural policy gradient algorithm. For other entropy functions, this method results in brand-new policy gradient algorithms. We prove that all these algorithms enjoy Newton-type quadratic convergence and that the corresponding gradient flow converges globally to the optimal solution. We use synthetic and industrial-scale examples to demonstrate that the proposed approximate Newton method typically converges in single-digit iterations, often orders of magnitude faster than other state-of-the-art algorithms.
翻译:近年来,策略梯度算法已被广泛应用于马尔可夫决策过程及强化学习问题中。采用各种熵函数的正则化方法常用于促进探索并提升稳定性。本文针对带熵正则化的策略梯度算法提出一种近似牛顿方法。在香农熵情形下,该算法可还原为自然策略梯度算法;对于其他熵函数,本方法则衍生出全新的策略梯度算法。我们证明所有此类算法均具有牛顿型二次收敛性,且对应的梯度流可全局收敛至最优解。通过合成数据与工业规模案例验证,所提出的近似牛顿方法通常只需个位数迭代次数即可收敛,其收敛速度常比现有最优算法快数个数量级。