U-statistics are a fundamental class of estimators that generalize the sample mean and underpin much of nonparametric statistics. Although extensively studied in both statistics and probability, key challenges remain: their high computational cost - addressed partly through incomplete U-statistics - and their non-standard asymptotic behavior in the degenerate case, which typically requires resampling methods for hypothesis testing. This paper presents a novel perspective on U-statistics, grounded in hypergraph theory and combinatorial designs. Our approach bypasses the traditional Hoeffding decomposition, the main analytical tool in this literature but one highly sensitive to degeneracy. By characterizing the dependence structure of a U-statistic, we derive a Berry-Esseen bound valid for incomplete U-statistics of deterministic designs, yielding conditions under which Gaussian limiting distributions can be established even in degenerate cases and when the order diverges. We also introduce efficient algorithms to construct incomplete U-statistics of equireplicate designs, a subclass of deterministic designs that, in certain cases, achieve minimum variance. Finally, we apply our framework to kernel-based tests that use Maximum Mean Discrepancy (MMD) and Hilbert-Schmidt Independence Criterion. In a real data example with the CIFAR-10 dataset, our permutation-free MMD test delivers substantial computational gains while retaining power and type I error control.
翻译:U统计量是一类基础估计量,它推广了样本均值,并构成了非参数统计学的核心。尽管在统计学和概率论中已得到广泛研究,但仍存在关键挑战:其高昂的计算成本——部分通过不完全U统计量得以缓解——以及在退化情形下的非标准渐近行为,这通常需要借助重抽样方法进行假设检验。本文基于超图理论与组合设计,提出了一种研究U统计量的新视角。我们的方法绕过了传统的Hoeffding分解(该领域的主要分析工具,但对退化性高度敏感)。通过刻画U统计量的依赖结构,我们推导出了适用于确定性设计不完全U统计量的Berry-Esseen界,给出了即使在退化情形及阶数发散时仍能建立高斯极限分布的条件。我们还提出了高效算法来构造等重复设计的不完全U统计量,这是确定性设计的一个子类,在某些情况下能达到最小方差。最后,我们将所提框架应用于基于核的检验方法,包括最大均值差异(MMD)与Hilbert-Schmidt独立性准则。在CIFAR-10数据集的真实数据示例中,我们无需置换的MMD检验在保持检验功效与第一类错误控制的同时,实现了显著的计算效率提升。