Two-sample hypothesis testing is a fundamental problem with various applications, which faces new challenges in the high-dimensional context. To mitigate the issue of the curse of dimensionality, high-dimensional data are typically assumed to lie on a low-dimensional manifold. To incorporate geometric information in the data, we propose to apply the Delaunay triangulation and develop the Delaunay weight to measure the geometric proximity among data points. In contrast to existing similarity measures that only utilize pairwise distances, the Delaunay weight can take both the distance and direction information into account. A detailed computation procedure is developed to learn the unknown manifold and approximate the Delaunay weight. We further propose a novel nonparametric test statistic using the Delaunay weight matrix. Asymptotic normality under the null and consistency under the alternative of the test statistic are developed. Applied on simulated data, the new test shows robustness to the learning of the unknown manifold and exhibits substantial power gain if the distributions differ directions. The proposed test also shows great power on a real dataset of mice protein expression levels.
翻译:双样本假设检验是一个具有广泛应用的基础性问题,但在高维背景下面临新的挑战。为缓解维数灾难问题,高维数据通常被假设位于低维流形上。为融入数据中的几何信息,本文提出应用Delaunay三角剖分并构建Delaunay权重来度量数据点间的几何邻近性。与仅利用成对距离的现有相似性度量不同,Delaunay权重能够同时考虑距离和方向信息。本文开发了一套详细的计算流程,用于学习未知流形并近似计算Delaunay权重。进一步地,我们提出了一种基于Delaunay权重矩阵的新型非参数检验统计量,并建立了该统计量在原假设下的渐近正态性及其在备择假设下的一致性。在模拟数据上的实验表明,新检验对未知流形学习具有鲁棒性,且在分布存在方向差异时展现出显著的检验功效提升。该检验方法在小鼠蛋白质表达水平真实数据集上也表现出强大的检验效能。