Comparing $K$-sample distributions is a fundamental problem in data science that arises in a wide variety of fields and applications. In this article, we introduce a maximum-of-differences approach to make such comparisons. Specifically, we first calculate the pairwise distances from the pooled observations of the $K$ samples. We then define the two observations as connected if their distance is less than a pre-specified threshold value. For each observation, we next calculate the ``within" and the ``between" probabilities associated with these two types of connections for the given observation, i.e., with other observations within the same sample and between the given observation and the observations in other samples. Subsequently, we propose a maximum-of-differences (MOD) test that finds the maximum value among the standardized squared differences between the ``within" and the ``between" probabilities of all observations. Accordingly, the proposed test is not only applicable to multivariate data with $K$ samples, but can also be extended to multivariate regression models. Furthermore, we obtain the covariance-adjusted (CA) version of the MOD (CA-MOD) test, which converges to the Type I extreme value distribution under some conditions. Moreover, we demonstrate the asymptotic properties of the two tests under both the null and alternative hypotheses. The performance and usefulness of the tests are illustrated via simulation studies and real examples.
翻译:比较$K$样本分布是数据科学中的一个基本问题,广泛应用于各个领域。本文提出一种基于最大差值的检验方法来进行此类比较。具体而言,我们首先计算$K$个样本合并观测值之间的两两距离,随后将距离小于预设阈值的两个观测值定义为"相连"。针对每个观测值,我们进一步计算其与同一样本内其他观测值的"内部"连接概率,以及与其他样本观测值的"之间"连接概率。基于此,我们提出最大差值检验,通过计算所有观测值标准化平方差中"内部"与"之间"概率的最大值进行检验。该方法不仅适用于多变量$K$样本数据,还可扩展至多变量回归模型。此外,我们推导了协方差调整版本的最大差值检验,该检验在特定条件下收敛于第一类极值分布。同时,我们在原假设和备择假设下证明了两种检验的渐近性质。通过模拟研究与实际案例分析验证了该检验方法的性能与实用性。