Kakade's natural policy gradient method has been studied extensively in recent years, showing linear convergence with and without regularization. We study another natural gradient method based on the Fisher information matrix of the state-action distributions which has received little attention from the theoretical side. Here, the state-action distributions follow the Fisher-Rao gradient flow inside the state-action polytope with respect to a linear potential. Therefore, we study Fisher-Rao gradient flows of linear programs more generally and show linear convergence with a rate that depends on the geometry of the linear program. Equivalently, this yields an estimate on the error induced by entropic regularization of the linear program which improves existing results. We extend these results and show sublinear convergence for perturbed Fisher-Rao gradient flows and natural gradient flows up to an approximation error. In particular, these general results cover the case of state-action natural policy gradients.
翻译:近年来,Kakade的自然策略梯度方法得到了广泛研究,其在有无正则化条件下均表现出线性收敛性。本文研究了另一种基于状态-动作分布费舍尔信息矩阵的自然梯度方法,该方法在理论层面尚未受到足够关注。在此框架下,状态-动作分布遵循状态-动作多面体内关于线性势能的费舍尔-拉奥梯度流。因此,我们更广泛地研究了线性规划的费舍尔-拉奥梯度流,并证明了其收敛速率取决于线性规划几何特性的线性收敛性。这等价于给出了线性规划熵正则化所引致误差的估计,该估计改进了现有结果。我们进一步扩展了这些结论,证明了受扰费舍尔-拉奥梯度流与自然梯度流在近似误差范围内的次线性收敛性。特别地,这些普适性结论涵盖了状态-动作自然策略梯度的情况。