Markov chains are fundamental to statistical machine learning, underpinning key methodologies such as Markov Chain Monte Carlo (MCMC) sampling and temporal difference (TD) learning in reinforcement learning (RL). Given their widespread use, it is crucial to establish rigorous probabilistic guarantees on their convergence, uncertainty, and stability. In this work, we develop novel, high-dimensional concentration inequalities and Berry-Esseen bounds for vector- and matrix-valued functions of Markov chains, addressing key limitations in existing theoretical tools for handling dependent data. We leverage these results to analyze the TD learning algorithm, a widely used method for policy evaluation in RL. Our analysis yields a sharp high-probability consistency guarantee that matches the asymptotic variance up to logarithmic factors. Furthermore, we establish a $O(T^{-\frac{1}{4}}\log T)$ distributional convergence rate for the Gaussian approximation of the TD estimator, measured in convex distance. These findings provide new insights into statistical inference for RL algorithms, bridging the gaps between classical stochastic approximation theory and modern reinforcement learning applications.
翻译:马尔可夫链是统计机器学习的基础,支撑着诸如马尔可夫链蒙特卡洛(MCMC)采样和强化学习(RL)中时序差分(TD)学习等关键方法。鉴于其广泛应用,对其收敛性、不确定性和稳定性建立严格的概率保证至关重要。本研究针对马尔可夫链的向量与矩阵值函数,提出了新颖的高维集中不等式及Berry-Esseen界,解决了现有理论工具在处理相依数据时的关键局限性。我们运用这些结果分析了强化学习中广泛使用的策略评估方法——TD学习算法。我们的分析得到了一个精确的高概率一致性保证,其渐近方差匹配度可达对数因子级别。此外,我们以凸距离度量,建立了TD估计量高斯近似的$O(T^{-\frac{1}{4}}\log T)$分布收敛速率。这些发现为强化学习算法的统计推断提供了新的见解,弥合了经典随机逼近理论与现代强化学习应用之间的差距。