We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $\epsilon$-SOSP is $O(\epsilon^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(\epsilon^{-4.5})$.
翻译:我们考虑强化学习中控制问题,其中模型信息不可获取。策略梯度算法是解决该问题的常用方法,通常被证明收敛到价值函数的平稳点。本文提出两种融合三次正则化的策略牛顿算法。两种算法均采用似然比方法,利用样本轨迹构建价值函数梯度与黑塞矩阵的估计值。第一种算法要求每次迭代精确求解三次正则化问题,而第二种算法采用基于梯度下降的高效近似方法处理三次正则化问题。我们证明所提出算法收敛到价值函数的二阶稳定点,从而避免鞍点陷阱。特别地,我们的算法找到$\epsilon$-二阶稳定点的样本复杂度为$O(\epsilon^{-3.5})$,优于现有最优的$O(\epsilon^{-4.5})$样本复杂度。