We study the convergence of off-policy TD(0) with linear function approximation when used to approximate the expected discounted reward in a Markov chain. It is well known that the combination of off-policy learning and function approximation can lead to divergence of the algorithm. Existing results for this setting modify the algorithm, for instance by reweighing the updates using importance sampling. This establishes convergence at the expense of additional complexity. In contrast, our approach is to analyse the standard algorithm, but to restrict our attention to the class of reversible Markov chains. We demonstrate convergence under this mild reversibility condition on the structure of the chain, which in many applications can be assumed using domain knowledge. In particular, we establish a convergence guarantee under an upper bound on the discount factor in terms of the difference between the on-policy and off-policy process. This improves upon known results in the literature that state that convergence holds for a sufficiently small discount factor by establishing an explicit bound. Convergence is with probability one and achieves projected Bellman error equal to zero. To obtain these results, we adapt the stochastic approximation framework that was used by Tsitsiklis and Van Roy [1997 for the on-policy case, to the off-policy case. We illustrate our results using different types of reversible Markov chains, such as one-dimensional random walks and random walks on a weighted graph.
翻译:本文研究了在马尔可夫链中,使用线性函数近似的离策略TD(0)算法用于逼近期望折扣奖励时的收敛性。众所周知,离策略学习与函数近似的结合可能导致算法发散。现有研究通过修改算法(例如使用重要性采样重新加权更新)来确保收敛,但这增加了算法复杂度。与之相反,我们的方法是分析标准算法,但将关注点限制在可逆马尔可夫链类别上。我们证明了在该链结构的温和可逆性条件下算法收敛,这一条件在许多应用中可通过领域知识假设成立。特别地,我们基于在线策略与离策略过程之间的差异,建立了折扣因子上界的收敛保证,通过给出显式边界改进了文献中“充分小折扣因子下收敛”的现有结论。算法以概率1收敛,且投影贝尔曼误差为零。为获得这些结果,我们将Tsitsiklis与Van Roy [1997] 用于在线策略情况的随机逼近框架适配至离策略场景。我们通过一维随机游走和加权图上的随机游走等不同类型的可逆马尔可夫链对结果进行了示例说明。