We study the convergence behavior of the celebrated temporal-difference (TD) learning algorithm. By looking at the algorithm through the lens of optimization, we first argue that TD can be viewed as an iterative optimization algorithm where the function to be minimized changes per iteration. By carefully investigating the divergence displayed by TD on a classical counter example, we identify two forces that determine the convergent or divergent behavior of the algorithm. We next formalize our discovery in the linear TD setting with quadratic loss and prove that convergence of TD hinges on the interplay between these two forces. We extend this optimization perspective to prove convergence of TD in a much broader setting than just linear approximation and squared loss. Our results provide a theoretical explanation for the successful application of TD in reinforcement learning.
翻译:本文研究了著名的时间差分(TD)学习算法的收敛行为。通过将算法置于优化视角下,我们首先论证了TD可被视作一种迭代优化算法,其中待最小化的函数随迭代步而变化。通过仔细探究TD在经典反例上表现出的发散现象,我们识别出决定算法收敛或发散行为的两种驱动力。随后在线性TD及二次损失设定下形式化这一发现,并证明TD的收敛性取决于这两种驱动力之间的相互作用。我们将此优化视角拓展至远超线性逼近和平方损失设定的范围,从而证明了TD的收敛性。我们的研究结果为TD在强化学习中的成功应用提供了理论解释。