We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.
翻译:我们研究了流行的时序差分(TD)学习算法在结合尾平均策略时的有限时间行为。我们在步长选择不依赖于投影TD不动点矩阵特征值信息的条件下,推导了尾平均TD迭代参数误差的有限时间界。分析表明,尾平均TD算法以最优的$O\left(1/t\right)$速率收敛,该结论在期望意义下和高概率意义下均成立。此外,我们的误差界显示出初始误差(偏差)具有更快的衰减速率,这相较于对所有迭代进行平均的方法有所改进。我们还提出并分析了引入正则化的TD变体。分析结果表明,正则化TD版本对于处理病态特征的问题具有实用价值。