This paper is concerned with the problem of policy evaluation with linear function approximation in discounted infinite horizon Markov decision processes. We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms: the temporal difference (TD) learning algorithm and the two-timescale linear TD with gradient correction (TDC) algorithm. In both the on-policy setting, where observations are generated from the target policy, and the off-policy setting, where samples are drawn from a behavior policy potentially different from the target policy, we establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level. We also exhihit an explicit dependence on problem-related quantities, and show in the on-policy setting that our upper bound matches the minimax lower bound on crucial problem parameters, including the choice of the feature maps and the problem dimension.
翻译:本文关注折扣无限视界马尔可夫决策过程中具有线性函数逼近的策略评估问题。我们研究了两种广泛使用的策略评估算法——时序差分学习算法和具有梯度校正的双时间尺度线性时序差分算法——为保证最佳线性系数的预定估计误差所需的样本复杂度。在目标策略生成观测的同策略设置中,以及样本可能来自与目标策略不同的行为策略的异策略设置中,我们建立了首个具有高概率收敛保证且实现最优容差水平依赖的样本复杂度界。我们还展示了与问题相关量的显式依赖关系,并证明在同策略设置中,我们的上界在关键问题参数(包括特征映射的选择和问题维度)上达到了极小化下界。