Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.
翻译:核双样本检验提供了一个基于$n$个样本点区分任意分布对的强大框架。然而,现有核检验要么以$n^2$时间运行,要么为提升运行时间而牺牲过多检验功效。为解决这些不足,我们提出"压缩而后检验"(CTT),一种基于样本压缩的高功效核检验新框架。CTT通过将每个$n$点样本压缩为规模较小但可证明保真度的核心集,廉价近似计算昂贵的检验。对于标准核函数和次指数分布,CTT继承了二次时间检验的统计行为(恢复相同的最优检测边界),同时以近线性时间运行。我们将这些进展与更廉价的置换检验相结合(通过新的功效分析加以佐证),改进了低秩逼近的时间-质量权衡保证,并引入快速聚合过程来识别最具鉴别力的核函数。在真实数据和模拟数据的实验中,CTT及其扩展方法在保持相同检验功效的前提下,相较于最先进的近似MMD检验实现了20至200倍的加速比。