Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.
翻译:在强化学习中,基于梯度的价值估计方法具有良好的稳定性,但其速度通常远低于时序差分(TD)学习方法。我们研究了这一缓慢性的根本原因,并指出均方贝尔曼误差(MSBE)是一个病态损失函数,其海森矩阵具有较大的条件数。为消除MSBE病态性对梯度方法的负面影响,我们提出了一种低复杂度的无批次邻近方法,该方法近似遵循高斯-牛顿方向,并在渐近意义上对参数化具有鲁棒性。我们的主要算法RANS在效率上显著优于残差梯度方法,同时保持几乎相同的计算复杂度,并在我们测试的经典问题上与TD方法具有竞争力。