We present a study of a kernel-based two-sample test statistic related to the Maximum Mean Discrepancy (MMD) in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, we show that when data densities are supported on a $d$-dimensional sub-manifold $\mathcal{M}$ embedded in an $m$-dimensional space, the kernel two-sample test for data sampled from a pair of distributions $p$ and $q$ that are H\"older with order $\beta$ (up to 2) is powerful when the number of samples $n$ is large such that $\Delta_2 \gtrsim n^{- { 2 \beta/( d + 4 \beta ) }}$, where $\Delta_2$ is the squared $L^2$-divergence between $p$ and $q$ on manifold. We establish a lower bound on the test power for finite $n$ that is sufficiently large, where the kernel bandwidth parameter $\gamma$ scales as $n^{-1/(d+4\beta)}$. The analysis extends to cases where the manifold has a boundary, and the data samples contain high-dimensional additive noise. Our results indicate that the kernel two-sample test does not have a curse-of-dimensionality when the data lie on or near a low-dimensional manifold. We validate our theory and the properties of the kernel test for manifold data through a series of numerical experiments.
翻译:我们针对流形数据场景下基于核的两样本检验统计量开展了研究,该统计量与最大均值差异(MMD)相关,并假设高维观测数据逼近低维流形。我们刻画了检验水平和检验功效与核带宽、样本数量及流形内在维度的关系。具体而言,当数据密度支撑在嵌入$m$维空间的$d$维子流形$\mathcal{M}$上,且采样自阶数为$\beta$(最高2阶)的Hölder连续分布对$p$和$q$时,核两样本检验在样本量$n$足够大且满足$\Delta_2 \gtrsim n^{-2\beta/(d+4\beta)}$(其中$\Delta_2$为流形上$p$与$q$之间的平方$L^2$散度)时具有检验功效。对于充分大的有限样本量$n$,我们建立了检验功效的下界,此时核带宽参数$\gamma$的尺度为$n^{-1/(d+4\beta)}$。该分析可扩展至流形存在边界以及数据样本包含高维加性噪声的情形。我们的结果表明:当数据位于或接近低维流形时,核两样本检验不存在维数灾难。通过系列数值实验验证了核检验用于流形数据的理论性质。