Statistical inference with finite-sample validity for the value function of a given policy in Markov decision processes (MDPs) is crucial for ensuring the reliability of reinforcement learning. Temporal Difference (TD) learning, arguably the most widely used algorithm for policy evaluation, serves as a natural framework for this purpose.In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results. First, we derive a novel sharp high-dimensional probability convergence guarantee that depends explicitly on the asymptotic variance and holds under weak conditions. We further establish refined high-dimensional Berry-Esseen bounds over the class of convex sets that guarantee faster rates than those in the literature. Finally, we propose a plug-in estimator for the asymptotic covariance matrix, designed for efficient online computation. These results enable the construction of confidence regions and simultaneous confidence intervals for the linear parameters of the value function, with guaranteed finite-sample coverage. We demonstrate the applicability of our theoretical findings through numerical experiments.
翻译:在马尔可夫决策过程中,对给定策略的价值函数进行具有有限样本有效性的统计推断,对于确保强化学习的可靠性至关重要。时序差分学习,可以说是策略评估中使用最广泛的算法,为此目的提供了一个自然的框架。本文研究了采用Polyak-Ruppert平均和线性函数近似的时序差分学习的一致性性质,并在现有结果的基础上取得了三项显著改进。首先,我们推导了一个新颖的、尖锐的高维概率收敛保证,该保证明确依赖于渐近方差,并在弱条件下成立。我们进一步建立了凸集类上的精细高维Berry-Esseen界,其收敛速率优于文献中的结果。最后,我们提出了一种用于渐近协方差矩阵的插件估计量,专为高效的在线计算而设计。这些结果使得能够为价值函数的线性参数构建置信区域和同时置信区间,并保证有限样本覆盖。我们通过数值实验证明了我们理论发现的适用性。