Kernel two-sample tests have been widely used for multivariate data to test equality of distributions. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space mainly target specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. We showcase the new approaches through two applications: The comparison of musks and non-musks using the shape of molecules, and the comparison of taxi trips starting from John F. Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests.
翻译:核双样本检验已被广泛用于多变量数据的分布等同性检验。然而,现有基于将分布映射到再生核希尔伯特空间的方法主要针对特定备择假设,且由于维数灾难,在数据维度中等至高维时对某些场景表现不佳。我们提出了一种新的检验统计量,利用中等和高维数据下的共同模式,在广泛备择假设下相比现有核双样本检验实现了显著的功效提升。我们还提出了替代检验流程,在保持高效能的同时降低计算成本,为大规模数据集提供便捷的现成工具。将新方法与多种场景下的其他先进检验进行比较,其表现出色。我们通过两个应用展示新方法:利用分子形状比较麝香与非麝香物质,以及对比连续月份从肯尼迪国际机场出发的出租车行程。所有提出方法均在R包kerTests中实现。