This paper analyzes multi-step TD-learning algorithms within the `deadly triad' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that n-step TD-learning algorithms converge to a solution as the sampling horizon n increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, and the control theoretic approach, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when n is sufficiently large. Based on these findings, two n-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the gradient and control theoretic algorithms.
翻译:本文分析了在“致命三元组”场景(即线性函数逼近、离线策略学习与自举相结合)下的多步时序差分学习算法。具体而言,我们证明当采样视界n足够大时,n步时序差分学习算法能收敛至一个稳定解。论文分为两部分。第一部分全面研究了基于模型的确定性对应算法的基本性质,包括投影值迭代、梯度下降算法以及控制理论方法——这些可视为原型确定性算法,其分析对于理解和开发相应的无模型强化学习算法具有关键作用。我们特别证明当n足够大时,这些算法能收敛至有意义的解。基于上述发现,我们提出并分析了两种n步时序差分学习算法,它们可被视为梯度算法与控制理论算法在无模型强化学习中的对应实现。