We investigate the statistical properties of Temporal Difference (TD) learning with Polyak-Ruppert averaging, arguably one of the most widely used algorithms in reinforcement learning, for the task of estimating the parameters of the optimal linear approximation to the value function. Assuming independent samples, we make three theoretical contributions that improve upon the current state-of-the-art results: (i) we establish refined high-dimensional Berry-Esseen bounds over the class of convex sets, achieving faster rates than the best known results, and (ii) we propose and analyze a novel, computationally efficient online plug-in estimator of the asymptotic covariance matrix; (iii) we derive sharper high probability convergence guarantees that depend explicitly on the asymptotic variance and hold under weaker conditions than those adopted in the literature. These results enable the construction of confidence regions and simultaneous confidence intervals for the linear parameters of the value function approximation, with guaranteed finite-sample coverage. We demonstrate the applicability of our theoretical findings through numerical experiments.
翻译:本文研究了带Polyak-Ruppert平均的时序差分(TD)学习的统计特性——该算法可视为强化学习领域应用最广泛的算法之一,其任务在于估计值函数最优线性逼近的参数。在独立样本假设下,我们提出了三项改进当前最先进成果的理论贡献:(i)建立了凸集类上的精细化高维Berry-Esseen界,获得了比已知最优结果更快的收敛速率;(ii)提出并分析了一种新颖的、计算高效的渐近协方差矩阵在线插件估计器;(iii)推导出更敏锐的高概率收敛保证,该保证显式依赖于渐近方差,且所需条件弱于文献中采用的条件。这些成果使得为值函数逼近的线性参数构建置信区域与同步置信区间成为可能,并确保有限样本覆盖性。我们通过数值实验验证了理论发现的实际适用性。