Quantifying the similarity of two or more datasets has widespread applications in statistics and machine learning. The method choice is, however, difficult due to the abundance of proposed methods and the lack of neutral comparison studies, especially for categorical data. Here, the most promising methods are compared concerning their ability to detect certain differences between datasets and their resource consumption. The results show that the edge count tests perform well when comparing two datasets (i.e., the two-sample case). For certain scenarios, the constrained minimum (CM) distance performs even better. For categorical data consisting of variables with five categories each, the best method depends on the type of difference between the distributions, with either the CM distance and certain graph-based tests performing best, or the classifier-based tests (C2ST). This tendency is even clearer for multiple datasets. Overall, the Friedman-Rafsky test can be recommended for two samples as a compromise of high performance, acceptable resource consumption, and computational error occurrences. For the multi-sample case, the Multi-Sample Mahalanobis Cross-Match (MMCM) test can be recommended due to its comparably good performance and low resource consumption.
翻译:量化两个或多个数据集之间的相似性在统计学和机器学习领域具有广泛的应用。然而,由于现有方法众多且缺乏中立的比较研究(特别是针对分类数据),方法选择十分困难。本文比较了最有前景的方法在检测数据集间特定差异方面的能力及其资源消耗。结果表明,当比较两个数据集(即双样本情况)时,边计数检验表现良好。在某些场景下,约束最小距离(CM)表现更优。对于由每个变量包含五个类别的分类数据,最佳方法取决于分布差异的类型——要么是CM距离和某些基于图的检验表现最佳,要么是基于分类器的检验(C2ST)。这一趋势在多个数据集的场景中更为明显。总体而言,弗里德曼-拉夫斯检验作为双样本场景下高性能、可接受的资源消耗和计算误差发生率之间的折中方案值得推荐。对于多样本场景,鉴于其相对较好的性能和低资源消耗,多样本马氏交叉匹配检验(MMCM)值得推荐。