Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}}(\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.
翻译:最近,策略梯度(PG)方法在实证上的出色成功推动了其理论基础的发展。尽管在高效随机型PG算法的设计上投入了大量努力,但对其收敛到全局最优策略的理解仍然有限。在本工作中,我们针对一类通用的Fisher非退化参数化策略(允许处理连续状态-动作空间的情况)提出了改进的全局收敛保证。首先,我们提出了一种带有隐式梯度传输的归一化策略梯度方法(N-PG-IGT),并推导出该方法找到全局ε-最优策略的样本复杂度为$\tilde{\mathcal{O}}(\varepsilon^{-2.5})$。该算法较先前已知的$\tilde{\mathcal{O}}(\varepsilon^{-3})$复杂度有所改进,且无需使用重要性采样或二阶信息,每次迭代仅采样一条轨迹。其次,我们通过考虑一种基于Hessian-向量乘的校正增强的Hessian辅助递归策略梯度((N)-HARPG)算法,进一步将复杂度改进至$\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$。有趣的是,这两种算法均满足:(i)简单且易于实现:单循环、无需大批量轨迹、每次迭代最多采样两条轨迹;(ii)计算和内存高效:每次迭代无需昂贵的子程序,且可实现与参数维度线性相关的内存占用。