Sutton, Szepesv\'{a}ri and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this paper is (a) to propose some variants of GTDs with extensive comparative analysis and (b) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex-concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal-dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate these approaches.
翻译:Sutton、Szepesvári和Maei首次提出了与线性函数逼近和离线策略训练兼容的梯度时序差分(GTD)学习算法。本文旨在:(a)提出GTD的若干变体并进行广泛的比较分析;(b)为GTD建立新的理论分析框架。这些变体基于GTD的凸-凹鞍点解释,有效将所有GTD统一至单一框架中,并基于近期关于原始-对偶梯度动力学的研究成果提供简单的稳定性分析。最后,通过数值比较分析对这些方法进行了评估。