Are Two Datasets Close Enough With Statistical Significance? A Kernel Distributional Closeness Testing Approach

Are two distributions close to each other with statistical significance? Distribution closeness testing (DCT) formalizes this question by testing whether the distance between a distribution pair is at least epsilon-far. Existing DCT methods mainly measure discrepancies between distribution pairs defined on discrete spaces, for example using total variation, which limits their application to complex data such as images. To extend DCT to more types of data, a natural idea is to introduce maximum mean discrepancy (MMD), a powerful measure of distributional discrepancy between complex distributions, into DCT scenarios. However, empirical results indicate that many distribution pairs can have the same MMD value despite having different norms in the same reproducing kernel Hilbert space (RKHS). These pairs may exhibit different finite-sample distinguishability and reflect different practical closeness levels, making MMD less informative for DCT. To mitigate this issue, we design a new measure of distributional discrepancy, norm-adaptive MMD (NAMMD), which scales the MMD value using the RKHS norms of distributions. Based on the asymptotic distribution of NAMMD, we propose NAMMD-based DCT to assess the closeness level of a distribution pair. Theoretically, we prove that NAMMD-based DCT has higher test power than MMD-based DCT while maintaining bounded type-I error. This is further validated by extensive experiments on multiple types of data, including synthetic noise and real images. Our code is available at https://github.com/zhijianzhouml/NAMMD.

翻译：两个分布在统计显著性上是否相互接近？分布接近性检验（DCT）通过检验分布对之间的距离是否至少为ε-远，形式化了这一提问。现有DCT方法主要衡量定义在离散空间上的分布对之间的差异，例如使用全变差，这限制了其在图像等复杂数据上的应用。为将DCT扩展到更多类型的数据，一个自然的想法是将最大均值差异（MMD）——一种衡量复杂分布之间分布差异的强大指标——引入DCT场景。然而，实证结果表明，许多分布对在同一个再生核希尔伯特空间（RKHS）中虽然具有不同的范数，却可能具有相同的MMD值。这些对可能表现出不同的有限样本可区分性，并反映不同的实际接近程度，从而使得MMD在DCT中信息量不足。为缓解这一问题，我们设计了一种新的分布差异衡量指标，即范数自适应MMD（NAMMD），它利用分布的RKHS范数对MMD值进行缩放。基于NAMMD的渐近分布，我们提出了基于NAMMD的DCT来评估分布对的接近程度。理论上，我们证明了基于NAMMD的DCT相比基于MMD的DCT具有更高的检验功效，同时保持有界的I类错误率。这一点进一步通过多种数据类型（包括合成噪声和真实图像）上的大量实验得到验证。我们的代码可在 https://github.com/zhijianzhouml/NAMMD 获取。