Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.
翻译:核双样本检验为基于$n$个样本点区分任意分布对提供了强大的框架。然而,现有核检验方法要么需要$n^2$时间运行,要么为提升运行速度而过度牺牲检验效能。为克服这些缺陷,我们提出了“压缩后检验”(CTT)这一基于样本压缩的新型高效核检验框架。CTT通过将每个$n$点样本压缩成规模较小但可证明具有高保真度的核心集,从而低成本地逼近昂贵的检验过程。对于标准核与次指数分布,CTT继承了二次时间检验的统计特性——恢复相同的最优检测边界——同时以近线性时间运行。我们将这些进展与更经济的置换检验相结合(该方法的合理性通过新的效能分析得以证明);改进了低秩近似的时间-质量保证;并提出了一种快速聚合程序以识别判别能力突出的核函数。在真实数据与模拟数据的实验中,CTT及其扩展方法相比最先进的近似MMD检验实现了20–200倍的加速,且未损失检验效能。