Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.
翻译:在强化学习中,基于梯度的方法在价值估计方面具有较好的稳定性,但通常远慢于时序差分(TD)学习方法。我们研究了这种缓慢的根源,并证明了均方贝尔曼误差(MSBE)是一个病态损失函数,即其海森矩阵具有较大的条件数。为解决MSBE不良条件数对梯度方法的负面影响,我们提出了一种低复杂度的无批处理近端方法,该方法近似遵循高斯-牛顿方向,并在渐近意义上对参数化具有鲁棒性。我们的主算法RANS在效率上显著快于残差梯度方法,同时几乎保持相同的计算复杂度,并在所测试的经典问题上与TD方法具有竞争力。