In this paper we consider the problem of obtaining sharp bounds for the performance of temporal difference (TD) methods with linear function approximation for policy evaluation in discounted Markov decision processes. We show that a simple algorithm with a universal and instance-independent step size together with Polyak-Ruppert tail averaging is sufficient to obtain near-optimal variance and bias terms. We also provide the respective sample complexity bounds. Our proof technique is based on refined error bounds for linear stochastic approximation together with the novel stability result for the product of random matrices that arise from the TD-type recurrence.
翻译:本文研究在折扣马尔可夫决策过程中,为时序差分(TD)方法结合线性函数近似进行策略评估时,获取其性能的尖锐界的问题。我们证明,采用一个具有通用且与实例无关的步长的简单算法,并结合Polyak-Ruppert尾部平均,便足以获得接近最优的方差与偏差项。我们还提供了相应的样本复杂度界。我们的证明技术基于对线性随机逼近的精细误差界,以及对源自TD型递推的随机矩阵乘积所提出的新颖稳定性结果。