We prove a convergence theorem for U-statistics of degree two, where the data dimension $d$ is allowed to scale with sample size $n$. We find that the limiting distribution of a U-statistic undergoes a phase transition from the non-degenerate Gaussian limit to the degenerate limit, regardless of its degeneracy and depending only on a moment ratio. A surprising consequence is that a non-degenerate U-statistic in high dimensions can have a non-Gaussian limit with a larger variance and asymmetric distribution. Our bounds are valid for any finite $n$ and $d$, independent of individual eigenvalues of the underlying function, and dimension-independent under a mild assumption. As an application, we apply our theory to two popular kernel-based distribution tests, MMD and KSD, whose high-dimensional performance has been challenging to study. In a simple empirical setting, our results correctly predict how the test power at a fixed threshold scales with $d$ and the bandwidth.
翻译:我们证明了一个关于二次U-统计量的收敛定理,其中数据维度$d$允许随样本量$n$一起增长。我们发现,无论U-统计量是否退化,其极限分布会经历一个从非退化高斯极限到退化极限的相变,该相变仅取决于一个矩比率。一个令人意外的结论是,在高维情况下,非退化的U-统计量可能具有非高斯极限,其方差更大且分布不对称。我们的界对任意有限的$n$和$d$均成立,不依赖于底层函数的单个特征值,并且在温和假设下与维度无关。作为应用,我们将该理论应用于两种流行的基于核的分布检验——MMD和KSD,这些方法在高维性能研究上一直颇具挑战性。在一个简单的实验设定中,我们的结果正确预测了固定阈值下的检验功效如何随$d$和带宽变化。