In this paper we consider the problem of obtaining sharp bounds for the performance of temporal difference (TD) methods with linear functional approximation for policy evaluation in discounted Markov Decision Processes. We show that a simple algorithm with a universal and instance-independent step size together with Polyak-Ruppert tail averaging is sufficient to obtain near-optimal variance and bias terms. We also provide the respective sample complexity bounds. Our proof technique is based on refined error bounds for linear stochastic approximation together with the novel stability result for the product of random matrices that arise from the TD-type recurrence.
翻译:本文研究了在折扣马尔可夫决策过程中,使用线性函数逼近进行策略评估的时序差分(TD)方法的性能界限问题。我们证明了一种采用通用且与实例无关的步长,并结合Polyak-Ruppert尾部平均的简单算法,足以获得接近最优的方差与偏差项,并给出了相应的样本复杂度界限。我们的证明技术基于线性随机逼近的精细误差界,以及由TD型递归产生的随机矩阵乘积的新型稳定性结果。