Nonparametric two-sample testing is a classical problem in inferential statistics. While modern two-sample tests, such as the edge count test and its variants, can handle multivariate and non-Euclidean data, contemporary gargantuan datasets often exhibit heterogeneity due to the presence of latent subpopulations. Direct application of these tests, without regulating for such heterogeneity, may lead to incorrect statistical decisions. We develop a new nonparametric testing procedure that accurately detects differences between the two samples in the presence of unknown heterogeneity in the data generation process. Our framework handles this latent heterogeneity through a composite null that entertains the possibility that the two samples arise from a mixture distribution with identical component distributions but with possibly different mixing weights. In this regime, we study the asymptotic behavior of weighted edge count test statistic and show that it can be effectively re-calibrated to detect arbitrary deviations from the composite null. For practical implementation we propose a Bootstrapped Weighted Edge Count test which involves a bootstrap-based calibration procedure that can be easily implemented across a wide range of heterogeneous regimes. A comprehensive simulation study and an application to detecting aberrant user behaviors in online games demonstrates the excellent non-asymptotic performance of the proposed test.
翻译:非参数两样本检验是推断统计学中的经典问题。尽管现代两样本检验(如边数检验及其变体)能够处理多元和非欧几里得数据,但当前庞大数据集常因潜在子群的存在而呈现异质性。若未对这种异质性进行调控而直接应用这些检验,可能导致错误的统计决策。我们提出了一种新的非参数检验流程,能够在数据生成过程存在未知异质性的情形下准确检测两样本间的差异。我们的框架通过复合原假设来应对这种潜在异质性——该假设考虑两样本可能来自混合分布,其中各组分分布相同但混合权重可能不同。在此框架下,我们研究了加权边数检验统计量的渐近性质,并证明其可被有效重新标定以检测对复合原假设的任意偏离。在实际实施中,我们提出了自助法加权边数检验(Bootstrapped Weighted Edge Count test),该方法采用基于自助法的标定流程,可轻松应用于多种异质性场景。综合仿真研究及在在线游戏中异常用户行为检测的应用表明,所提检验具有优异的非渐近性能。