Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.
翻译:在强化学习中,基于梯度的价值估计方法具有良好的稳定性,但其速度通常远慢于时序差分(TD)学习方法。我们研究了这种缓慢性的根本原因,并表明均方贝尔曼误差(MSBE)作为损失函数存在病态性,即其海森矩阵具有很大的条件数。为消除MSBE病态性对梯度方法的负面影响,我们提出了一种低复杂度、免批次的邻近方法,该方法近似遵循高斯-牛顿方向,并且在渐近意义上对参数化具有鲁棒性。我们的主要算法RANS具有高效率:在计算复杂度几乎相同的情况下,其速度显著快于残差梯度方法,并且在我们测试的经典问题上与TD方法具有竞争力。