This paper is concerned with the problem of policy evaluation with linear function approximation in discounted infinite horizon Markov decision processes. We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms: the temporal difference (TD) learning algorithm and the two-timescale linear TD with gradient correction (TDC) algorithm. In both the on-policy setting, where observations are generated from the target policy, and the off-policy setting, where samples are drawn from a behavior policy potentially different from the target policy, we establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level. We also exhihit an explicit dependence on problem-related quantities, and show in the on-policy setting that our upper bound matches the minimax lower bound on crucial problem parameters, including the choice of the feature maps and the problem dimension.
翻译:本文关注折扣无限时域马尔可夫决策过程中基于线性函数逼近的策略评估问题。我们研究了两种广泛使用的策略评估算法——时序差分(TD)学习算法和双时间尺度线性TD梯度修正(TDC)算法——在保证最佳线性系数预定义估计误差时所需的样本复杂度。在目标策略生成观测数据的同策略设置以及样本来自可能与目标策略不同的行为策略的离策略设置中,我们首次建立了具有高概率收敛保证的样本复杂度界,该界在容差水平上达到了最优依赖关系。我们还给出了与问题相关量的显式依赖关系,并表明在同策略设置中,我们的上界在关键问题参数(包括特征映射的选择和问题维度)上匹配极小极大下界。