Theoretical work on Temporal Difference (TD) learning has provided finite-sample and high-probability guarantees for data generated from Markov chains. However, these bounds typically require linear function approximation, instance-dependent step sizes, algorithmic modifications, and restrictive mixing rates. We present theoretical findings for TD learning under more applicable assumptions, including instance-independent step sizes, full data utilization, and polynomial ergodicity, applicable to both linear and non-linear functions. \textbf{To our knowledge, this is the first proof of TD(0) convergence on Markov data under universal and instance-independent step sizes.} While each contribution is significant on its own, their combination allows these bounds to be effectively utilized in practical application settings. Our results include bounds for linear models and non-linear under generalized gradients and H\"older continuity.
翻译:关于时序差分学习的研究已为马尔可夫链生成的数据提供了有限样本和高概率的理论保证。然而,这些界限通常要求线性函数逼近、依赖于具体实例的步长、算法修改以及严格的混合速率。我们在更具适用性的假设下提出了时序差分学习的理论结果,包括与实例无关的步长设置、数据完全利用以及多项式遍历性,这些结果适用于线性和非线性函数。\textbf{据我们所知,这是在通用且与实例无关的步长设置下,首个证明TD(0)在马尔可夫数据上收敛的工作。}尽管每项贡献本身都具有重要意义,但它们的结合使得这些界限能够在实际应用场景中得到有效利用。我们的结果涵盖了线性模型以及广义梯度和H\"older连续性条件下的非线性模型的收敛界限。