The two-sample problem consists in testing whether two independent samples are drawn from the same (unknown) probability distribution. It finds applications in many areas, ranging from clinical trials to data attribute matching. Its study in high-dimension is the subject of much attention, in particular as the information acquisition processes can involve various sources being often poorly controlled, possibly leading to datasets with strong sampling bias that may jeopardize their statistical analysis. While classic methods relying on a discrepancy measure between empirical versions of the distributions face the curse of dimensionality, we develop an alternative approach based on statistical learning and extending rank tests, known to be asymptotically optimal for univariate data when appropriately designed. Overcoming the lack of natural order on high-dimension, it is implemented in two steps. Assigning a label to each sample, and dividing them into two halves, a preorder on the feature space defined by a real-valued scoring function is learned by a bipartite ranking algorithm applied to the first halves. Next, a two-sample homogeneity rank test is applied to the (univariate) scores of the remaining observations. Because it learns how to map the data onto the real line like (any monotone transform of) the likelihood ratio between the original multivariate distributions, the approach is not affected by the dimensionality, ignores ranking model bias issues, and preserves the asymptotic optimality of univariate R-tests, capable of detecting small departures from the null assumption. Beyond a theoretical analysis establishing nonasymptotic bounds for the two types of error of the method based on recent concentration results for two-sample linear R-processes, an extensive experimental study shows higher performance of the proposed method compared to classic ones.
翻译:双样本问题旨在检验两个独立样本是否来自同一(未知)概率分布。该问题在临床试验至数据属性匹配等多个领域均有应用。高维情形下的研究备受关注,尤其当信息采集过程涉及多种来源且常缺乏有效控制时,可能导致数据集存在显著采样偏差,从而威胁统计分析的可靠性。针对经典方法依赖经验分布间差异度量而面临维度灾难的困境,本文提出一种基于统计学习且扩展了秩检验的替代方案——秩检验在单变量数据中经合理设计可达渐近最优性。通过克服高维自然排序缺失问题,该方法分两步实现:首先为每个样本分配标签并均分为两半,利用二分排序算法对特征空间进行预排序(由实值评分函数定义);随后对剩余观测值的(单变量)评分执行双样本同质性秩检验。由于该方法如同原始多变量分布似然比(的任意单调变换)般学习如何将数据映射至实直线,故不受维度影响、忽略排序模型偏差问题,并保留了单变量R检验的渐近最优性,能检测出对零假设的微小偏离。除基于双样本线性R过程的最新集中度结果建立方法两类错误的非渐近界理论分析外,大量实验研究证明:与经典方法相比,所提方法性能更优。