The distribution closeness testing (DCT) assesses whether the distance between a distribution pair is at least $\epsilon$-far. Existing DCT methods mainly measure discrepancies between a distribution pair defined on discrete one-dimensional spaces (e.g., using total variation), which limits their applications to complex data (e.g., images). To extend DCT to more types of data, a natural idea is to introduce maximum mean discrepancy (MMD), a powerful measurement of the distributional discrepancy between two complex distributions, into DCT scenarios. However, we find that MMD's value can be the same for many pairs of distributions that have different norms in the same reproducing kernel Hilbert space (RKHS), making MMD less informative when assessing the closeness levels for multiple distribution pairs. To mitigate the issue, we design a new measurement of distributional discrepancy, norm-adaptive MMD (NAMMD), which scales MMD's value using the RKHS norms of distributions. Based on the asymptotic distribution of NAMMD, we finally propose the NAMMD-based DCT to assess the closeness levels of a distribution pair. Theoretically, we prove that NAMMD-based DCT has higher test power compared to MMD-based DCT, with bounded type-I error, which is also validated by extensive experiments on many types of data (e.g., synthetic noise, real images). Furthermore, we also apply the proposed NAMMD for addressing the two-sample testing problem and find NAMMD-based two-sample test has higher test power than the MMD-based two-sample test in both theory and experiments.
翻译:分布接近性检验旨在评估一对分布之间的距离是否至少为$\epsilon$-远。现有的分布接近性检验方法主要衡量定义在离散一维空间上的分布对之间的差异(例如使用总变差),这限制了它们对复杂数据(如图像)的应用。为了将分布接近性检验扩展到更多类型的数据,一个自然的想法是将最大均值差异引入分布接近性检验场景。最大均值差异是衡量两个复杂分布之间分布差异的有力工具。然而,我们发现对于在同一再生核希尔伯特空间中具有不同范数的许多分布对,最大均值差异的值可能相同,这使得在评估多个分布对的接近程度时,最大均值差异提供的信息较少。为了缓解这一问题,我们设计了一种新的分布差异度量方法——范数自适应最大均值差异,该方法利用分布的再生核希尔伯特空间范数对最大均值差异的值进行缩放。基于范数自适应最大均值差异的渐近分布,我们最终提出了基于范数自适应最大均值差异的分布接近性检验,以评估分布对的接近程度。理论上,我们证明了与基于最大均值差异的分布接近性检验相比,基于范数自适应最大均值差异的分布接近性检验具有更高的检验功效,且第一类错误有界,这一点也在多种类型数据(如合成噪声、真实图像)上的大量实验中得到验证。此外,我们还应用所提出的范数自适应最大均值差异解决两样本检验问题,发现无论在理论还是实验中,基于范数自适应最大均值差异的两样本检验都比基于最大均值差异的两样本检验具有更高的检验功效。