We derive uniform all-time concentration bound of the type 'for all $n \geq n_0$ for some $n_0$' for TD(0) with linear function approximation. We work with online TD learning with samples from a single sample path of the underlying Markov chain. This makes our analysis significantly different from offline TD learning or TD learning with access to independent samples from the stationary distribution of the Markov chain. We treat TD(0) as a contractive stochastic approximation algorithm, with both martingale and Markov noises. Markov noise is handled using the Poisson equation and the lack of almost sure guarantees on boundedness of iterates is handled using the concept of relaxed concentration inequalities.
翻译:本文针对采用线性函数近似的TD(0)算法,推导了“对于所有$n \geq n_0$(其中$n_0$为某特定值)”形式的均匀全时段集中性界。我们研究基于底层马尔可夫链单一样本路径采样的在线TD学习,这使得我们的分析与离线TD学习或能够从马尔可夫链平稳分布获取独立样本的TD学习存在显著差异。我们将TD(0)视为一种具有鞅噪声与马尔可夫噪声的压缩随机逼近算法:通过泊松方程处理马尔可夫噪声,并借助松弛集中不等式的概念解决迭代值有界性缺乏几乎必然保证的问题。