Two-sample tests for multivariate data and especially for non-Euclidean data are not well explored. This paper presents a novel test statistic based on a similarity graph constructed on the pooled observations from the two samples. It can be applied to multivariate data and non-Euclidean data as long as a dissimilarity measure on the sample space can be defined, which can usually be provided by domain experts. Existing tests based on a similarity graph lack power either for location or for scale alternatives. The new test utilizes a common pattern that was overlooked previously, and works for both types of alternatives. The test exhibits substantial power gains in simulation studies. Its asymptotic permutation null distribution is derived and shown to work well under finite samples, facilitating its application to large data sets. The new test is illustrated on two applications: The assessment of covariate balance in a matched observational study, and the comparison of network data under different conditions.
翻译:针对多变量数据,尤其是非欧几里得数据的双样本检验方法尚未得到充分探索。本文提出了一种基于相似图构建的新型检验统计量,该图构建于两个样本的合并观测值之上。只要能在样本空间上定义相异性度量(通常可由领域专家提供),该方法即可应用于多变量数据与非欧几里得数据。现有的基于相似图的检验方法对位置参数或尺度参数的备择假设均缺乏检验效能。新方法利用了一个先前被忽视的共有模式,可同时适用于两类备择假设。模拟研究显示该检验具有显著的效能提升。本文推导了其渐近置换零分布,并证明其在有限样本下表现良好,有利于在大规模数据集中的应用。新方法通过两个应用案例进行说明:匹配观察性研究中的协变量平衡评估,以及不同条件下网络数据的比较。