Linear TD($\lambda$) is one of the most fundamental reinforcement learning algorithms for policy evaluation. Previously, convergence rates are typically established under the assumption of linearly independent features, which does not hold in many practical scenarios. This paper instead establishes the first $L^2$ convergence rates for linear TD($\lambda$) operating under arbitrary features, without making any algorithmic modification or additional assumptions. Our results apply to both the discounted and average-reward settings. To address the potential non-uniqueness of solutions resulting from arbitrary features, we develop a novel stochastic approximation result featuring convergence rates to the solution set instead of a single point.
翻译:线性TD($\lambda$)是策略评估中最基础的强化学习算法之一。以往的研究通常在特征线性独立的假设下建立收敛速率,而该假设在许多实际场景中并不成立。本文首次建立了线性TD($\lambda$)在任意特征条件下运行的$L^2$收敛速率,且无需任何算法修改或额外假设。我们的结果同时适用于折扣回报与平均回报两种设定。针对任意特征可能导致解不唯一的问题,我们提出了一种新颖的随机逼近结果,该结果刻画了算法向解集而非单个点的收敛速率。