Many reinforcement learning algorithms rely on value estimation, however, the most widely used algorithms -- namely temporal difference algorithms -- can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation based on the linear mean squared projected Bellman error (MSPBE) and are sound under linear function approximation. Extending these methods to the nonlinear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective -- the mean-squared Bellman error (MSBE) -- which naturally facilitate nonlinear approximation. In this work, we build on these insights and introduce a new generalized MSPBE that extends the linear MSPBE to the nonlinear setting. We show how this generalized objective unifies previous work and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective, and show that it is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.
翻译:许多强化学习算法依赖于价值估计,然而最广泛使用的算法——即时序差分算法——在离策略采样和非线性函数逼近下可能发散。基于线性均方投影贝尔曼误差(MSPBE)已开发出多种离策略价值估计算法,这些算法在线性函数逼近条件下具有理论保证。但将这些方法扩展到非线性场景的努力大多未能成功。最近,若干新方法通过逼近不同目标——均方贝尔曼误差(MSBE)——自然实现了非线性逼近。本工作基于这些进展,提出了一种新的广义MSPBE,将线性MSPBE扩展至非线性场景。我们证明该广义目标如何统一前人工作,并获得了广义目标解的价值误差新边界。我们推导出一个易于使用且理论完备的算法来最小化该广义目标,实验表明该算法在不同运行批次间更稳定、对超参数更不敏感,并在四个控制领域中配合神经网络函数逼近均取得优越性能。