Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance. In recent years, several variance-reduced PG methods have been proposed with a theoretical guarantee of converging to an approximate first-order stationary point (FOSP) with the sample complexity of $O(\epsilon^{-3})$. However, FOSPs could be bad local optima or saddle points. Moreover, these algorithms often use importance sampling (IS) weights which could impair the statistical effectiveness of variance reduction. In this paper, we propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $\tilde{O}(\epsilon^{-3})$. This rate improves the best-known sample complexity for achieving approximate SOSPs by a factor of $O(\epsilon^{-0.5})$. Moreover, the proposed variance reduction technique bypasses IS weights by using HVP terms. Our experimental results show that the proposed algorithm outperforms the state of the art and is more robust to changes in random seeds.
翻译:策略梯度(Policy Gradient, PG)因其可扩展性和良好性能而广泛应用于强化学习。近年来,几种方差缩减的PG方法被提出,并具有收敛到近似一阶稳定点(First-Order Stationary Point, FOSP)的理论保证,样本复杂度为$O(\epsilon^{-3})$。然而,FOSP可能是不良局部最优解或鞍点。此外,这些算法常使用重要性采样(Importance Sampling, IS)权重,这可能会削弱方差缩减的统计有效性。本文提出一种方差缩减的二阶方法,利用黑塞向量积(Hessian Vector Product, HVP)形式的二阶信息,并以样本复杂度$\tilde{O}(\epsilon^{-3})$收敛到近似二阶稳定点(Second-Order Stationary Point, SOSP)。该速率将实现近似SOSP的最佳已知样本复杂度提升了$O(\epsilon^{-0.5})$倍。此外,所提出的方差缩减技术通过使用HVP项绕过了IS权重。实验结果表明,所提算法优于当前最先进技术,且对随机种子的变化更加鲁棒。