Kernel two-sample tests have been widely used, and the development of efficient methods for high-dimensional, large-scale data is receiving increasing attention in the big data era. However, existing methods, such as the maximum mean discrepancy (MMD) and recently proposed kernel-based tests for large-scale data, are computationally intensive and/or ineffective for some common alternatives in high-dimensional data. In this paper, we propose a new test that exhibits high power across a wide range of alternatives. Furthermore, the new test is more robust to high dimensions than existing methods and does not require optimization procedures for choosing kernel bandwidth and other parameters through data splitting. Numerical studies demonstrate that the new approach performs well on both synthetic and real-world data.
翻译:核双样本检验已被广泛应用,而在大数据时代,针对高维、大规模数据的高效方法开发正受到越来越多的关注。然而,现有方法,如最大均值差异(MMD)以及近期提出的面向大规模数据的基于核的检验,对于高维数据中的某些常见备择假设,计算量较大且/或检验效能不足。本文提出了一种新的检验方法,其在广泛的备择假设范围内均表现出较高的检验效能。此外,与现有方法相比,新方法对高维数据更具鲁棒性,并且无需通过数据分割来优化选择核带宽及其他参数。数值研究表明,新方法在合成数据和真实世界数据上均表现良好。