Two-sample hypothesis testing-determining whether two sets of data are drawn from the same distribution-is a fundamental problem in statistics and machine learning with broad scientific applications. In the context of nonparametric testing, maximum mean discrepancy (MMD) has gained popularity as a test statistic due to its flexibility and strong theoretical foundations. However, its use in large-scale scenarios is plagued by high computational costs. In this work, we use a Nyström approximation of the MMD to design a computationally efficient and practical testing algorithm while preserving statistical guarantees. Our main result is a finite-sample bound on the power of the proposed test for distributions that are sufficiently separated with respect to the MMD. The derived separation rate matches the known minimax optimal rate in this setting. We support our findings with a series of numerical experiments, emphasizing applicability to realistic scientific data.
翻译:双样本假设检验——判断两组数据是否来自同一分布——是统计学与机器学习领域的一个基本问题,具有广泛的科学应用。在非参数检验的背景下,最大均值差异(MMD)因其灵活性和坚实的理论基础而作为检验统计量广受欢迎。然而,其在大规模场景中的应用受限于高昂的计算成本。本文利用MMD的Nyström近似设计了一种计算高效且实用的检验算法,同时保持了统计保证。我们的主要结果是在分布相对于MMD充分分离的条件下,所提检验功效的有限样本界。导出的分离速率与此设定下的已知极小极大最优速率相匹配。我们通过一系列数值实验支持了我们的发现,重点强调了在真实科学数据上的适用性。