Rank-based approaches are among the most popular nonparametric methods for univariate data in tackling statistical problems such as hypothesis testing due to their robustness and effectiveness. However, they are unsatisfactory for more complex data. In the era of big data, high-dimensional and non-Euclidean data, such as networks and images, are ubiquitous and pose challenges for statistical analysis. Existing multivariate ranks such as component-wise, spatial, and depth-based ranks do not apply to non-Euclidean data and have limited performance for high-dimensional data. Instead of dealing with the ranks of observations, we propose two types of ranks applicable to complex data based on a similarity graph constructed on observations: a graph-induced rank defined by the inductive nature of the graph and an overall rank defined by the weight of edges in the graph. To illustrate their utilization, both the new ranks are used to construct test statistics for the two-sample hypothesis testing, which converge to the $\chi_2^2$ distribution under the permutation null distribution and some mild conditions of the ranks, enabling an easy type-I error control. Simulation studies show that the new method exhibits good power under a wide range of alternatives compared to existing methods. The new test is illustrated on the New York City taxi data for comparing travel patterns in consecutive months and a brain network dataset comparing male and female subjects.
翻译:基于秩的方法因其鲁棒性和有效性,是单变量数据中解决假设检验等统计问题最流行的非参数方法之一。然而,这些方法在处理更复杂的数据时并不理想。在大数据时代,高维和非欧几里得数据(如网络和图像)普遍存在,给统计分析带来了挑战。现有的多变量秩方法(如分量秩、空间秩和深度秩)不适用于非欧几里得数据,且在高维数据上性能有限。我们提出基于观测数据构建的相似图,定义两种适用于复杂数据的秩:一种是通过图的归纳性质定义的图诱导秩,另一种是由图中边的权重定义的整体秩。为说明其应用,这两种新秩被用于构造两样本假设检验的检验统计量,在置换零分布和秩的某些温和条件下,该统计量收敛于$\chi_2^2$分布,从而易于控制第一类错误。模拟研究表明,与现有方法相比,新方法在多种备择假设下表现出良好的检验功效。该新检验方法被应用于纽约市出租车数据(比较连续月份的出行模式)和大脑网络数据集(比较男性和女性受试者)中。