Non-asymptotic central limit theorem (CLT) rates play a central role in modern machine learning and operations research. In this paper, we study CLT rates for multivariate dependent data in Wasserstein-$p$ ($W_p$) distance, for general $p\ge 1$. We focus on two fundamental dependence structures that commonly arise in practice: locally dependent sequences and geometrically ergodic Markov chains. In both settings, we establish the first optimal $\mathcal O(n^{-1/2})$ rate in $W_1$, as well as the first $W_p$ ($p\ge 2$) CLT rates under mild moment assumptions, substantially improving the best previously known bounds in these dependent-data regimes. As an application of our optimal $W_1$ rate for locally dependent sequences, we further obtain the first optimal $W_1$-CLT rate for multivariate $U$-statistics. On the technical side, we derive a tractable auxiliary bound for $W_1$ Gaussian approximation errors that is well suited for studying dependent data. For Markov chains, we further prove that the regeneration time of the split chain associated with a geometrically ergodic chain has a geometric tail without assuming strong aperiodicity or other restrictive conditions. These tools may be of independent interests and enable our optimal $W_1$ rates and underpin our $W_p$ ($p\ge 2$) results.
翻译:非渐近中心极限定理(CLT)速率在现代机器学习与运筹学中扮演着核心角色。本文研究Wasserstein-p($W_p$)距离下多元相依数据的CLT速率,其中$p\ge 1$。我们聚焦于实践中常见的两类基本依赖结构:局部依赖序列与几何遍历马尔可夫链。在这两种设定下,我们首次建立了$W_1$距离下的最优$\mathcal O(n^{-1/2})$速率,以及在温和矩假设下首个$p\ge 2$时的$W_p$ CLT速率,显著改进了这些相依数据场景中已知的最佳边界。作为局部依赖序列最优$W_1$速率的一个应用,我们进一步获得了多元U统计量的首个最优$W_1$ CLT速率。在技术层面,我们推导出一个易于处理的$W_1$高斯逼近误差辅助界,该界特别适用于研究相依数据。针对马尔可夫链,我们进一步证明:在无需强非周期性或其他限制性条件的前提下,与几何遍历链相关联的裂链的再生时间具有几何尾部。这些工具可能具有独立研究价值,不仅支撑了我们的最优$W_1$速率,也奠定了$p\ge 2$时$W_p$结果的基础。