We present a study of a kernel-based two-sample test statistic related to the Maximum Mean Discrepancy (MMD) in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, when data densities $p$ and $q$ are supported on a $d$-dimensional sub-manifold ${M}$ embedded in an $m$-dimensional space and are H\"older with order $\beta$ (up to 2) on ${M}$, we prove a guarantee of the test power for finite sample size $n$ that exceeds a threshold depending on $d$, $\beta$, and $\Delta_2$ the squared $L^2$-divergence between $p$ and $q$ on the manifold, and with a properly chosen kernel bandwidth $\gamma$. For small density departures, we show that with large $n$ they can be detected by the kernel test when $\Delta_2$ is greater than $n^{- { 2 \beta/( d + 4 \beta ) }}$ up to a certain constant and $\gamma$ scales as $n^{-1/(d+4\beta)}$. The analysis extends to cases where the manifold has a boundary and the data samples contain high-dimensional additive noise. Our results indicate that the kernel two-sample test has no curse-of-dimensionality when the data lie on or near a low-dimensional manifold. We validate our theory and the properties of the kernel test for manifold data through a series of numerical experiments.
翻译:我们研究了一种基于核的双样本检验统计量,该统计量与最大均值差异(MMD)相关,适用于流形数据场景,假设高维观测数据接近低维流形。我们刻画了检验水平与功效与核带宽、样本数量以及流形内在维度之间的关系。具体而言,当数据密度 $p$ 和 $q$ 支撑在嵌入 $m$ 维空间的 $d$ 维子流形 ${M}$ 上,且在 ${M}$ 上为阶数 $\beta$(不超过2)的Hölder连续时,我们证明了对于超过某一阈值的有限样本量 $n$,检验功效有保证,该阈值取决于 $d$、$\beta$ 以及流形上 $p$ 与 $q$ 之间的平方 $L^2$ 散度 $\Delta_2$,且核带宽 $\gamma$ 选择适当。对于较小的密度偏离,我们证明当 $\Delta_2$ 大于(在一定常数范围内)$n^{- { 2 \beta/( d + 4 \beta ) }}$ 且 $\gamma$ 按 $n^{-1/(d+4\beta)}$ 标度时,大样本 $n$ 下核检验可以检测到它们。分析扩展到流形有边界以及数据样本包含高维加性噪声的情形。我们的结果表明,当数据位于或接近低维流形时,核双样本检验无维度诅咒。我们通过一系列数值实验验证了我们的理论以及流形数据核检验的性质。