The two-sample problem, which consists in testing whether independent samples on $\mathbb{R}^d$ are drawn from the same (unknown) distribution, finds applications in many areas. Its study in high-dimension is the subject of much attention, especially because the information acquisition processes at work in the Big Data era often involve various sources, poorly controlled, leading to datasets possibly exhibiting a strong sampling bias. While classic methods relying on the computation of a discrepancy measure between the empirical distributions face the curse of dimensionality, we develop an alternative approach based on statistical learning and extending rank tests, capable of detecting small departures from the null assumption in the univariate case when appropriately designed. Overcoming the lack of natural order on $\mathbb{R}^d$ when $d\geq 2$, it is implemented in two steps. Assigning to each of the samples a label (positive vs. negative) and dividing them into two parts, a preorder on $\mathbb{R}^d$ defined by a real-valued scoring function is learned by means of a bipartite ranking algorithm applied to the first part and a rank test is applied next to the scores of the remaining observations to detect possible differences in distribution. Because it learns how to project the data onto the real line nearly like (any monotone transform of) the likelihood ratio between the original multivariate distributions would do, the approach is not much affected by the dimensionality, ignoring ranking model bias issues, and preserves the advantages of univariate rank tests. Nonasymptotic error bounds are proved based on recent concentration results for two-sample linear rank-processes and an experimental study shows that the approach promoted surpasses alternative methods standing as natural competitors.
翻译:双样本问题旨在检验$\mathbb{R}^d$上的独立样本是否来自同一(未知)分布,其在众多领域均有应用。高维情况下的该问题研究备受关注,尤其因为大数据时代的信息采集过程常涉及多种来源且控制不充分,可能导致数据集存在显著采样偏差。传统方法依赖于计算经验分布之间的差异度量,但面临维数灾难的困境。我们提出一种基于统计学习且扩展了秩检验的替代方法,该方法在单变量情形下若设计得当,能够检测出对零假设的微小偏离。为克服$d\geq 2$时$\mathbb{R}^d$上缺乏自然顺序的困难,该方法分两步实现。首先为每个样本分配标签(正类 vs. 负类)并将其分为两部分,利用应用于第一部分的二分排序算法学习由实值评分函数定义的$\mathbb{R}^d$上的预序;接着对剩余观测值的评分应用秩检验以检测分布可能的差异。由于该方法能学会近似于(原始多元分布似然比的任意单调变换)将数据投影到实直线上的方式,因此受维数影响较小(忽略排序模型偏差问题),并保留了单变量秩检验的优势。基于两样本线性秩过程的最新集中性结果,我们证明了非渐近误差界,实验研究表明该方法的性能优于作为自然竞争者的替代方法。