Maximum Mean Discrepancy (MMD) has been widely used in the areas of machine learning and statistics to quantify the distance between two distributions in the $p$-dimensional Euclidean space. The asymptotic property of the sample MMD has been well studied when the dimension $p$ is fixed using the theory of U-statistic. As motivated by the frequent use of MMD test for data of moderate/high dimension, we propose to investigate the behavior of the sample MMD in a high-dimensional environment and develop a new studentized test statistic. Specifically, we obtain the central limit theorems for the studentized sample MMD as both the dimension $p$ and sample sizes $n,m$ diverge to infinity. Our results hold for a wide range of kernels, including popular Gaussian and Laplacian kernels, and also cover energy distance as a special case. We also derive the explicit rate of convergence under mild assumptions and our results suggest that the accuracy of normal approximation can improve with dimensionality. Additionally, we provide a general theory on the power analysis under the alternative hypothesis and show that our proposed test can detect difference between two distributions in the moderately high dimensional regime. Numerical simulations demonstrate the effectiveness of our proposed test statistic and normal approximation.
翻译:最大均值差异(MMD)已被广泛应用于机器学习和统计学领域,用于量化$p$维欧氏空间中两个分布之间的距离。当维度$p$固定时,利用U统计量理论,样本MMD的渐近性质已得到充分研究。受MMD检验常用于中等/高维数据这一事实的启发,我们拟探究高维环境下样本MMD的行为并构建新的学生化检验统计量。具体而言,当维度$p$与样本量$n,m$均趋于无穷时,我们得到了学生化样本MMD的中心极限定理。该结果适用于包括高斯核和拉普拉斯核在内的广泛核函数,同时也涵盖能量距离这一特例。在温和假设下,我们推导出显式的收敛速率,表明正态近似的精度可随维度增加而提升。此外,我们提供了备择假设下功效分析的通用理论,证明所提检验能在中等高维场景中检测两个分布间的差异。数值模拟验证了所提检验统计量及正态近似的有效性。