An Empirical Comparison of Methods for Quantifying the Similarity of Categorical Datasets

Quantifying the similarity of two or more datasets has widespread applications in statistics and machine learning. The method choice is, however, difficult due to the abundance of proposed methods and the lack of neutral comparison studies, especially for categorical data. Here, the most promising methods are compared concerning their ability to detect certain differences between datasets and their resource consumption. The results show that the edge count tests perform well when comparing two datasets (i.e., the two-sample case). For certain scenarios, the constrained minimum (CM) distance performs even better. For categorical data consisting of variables with five categories each, the best method depends on the type of difference between the distributions, with either the CM distance and certain graph-based tests performing best, or the classifier-based tests (C2ST). This tendency is even clearer for multiple datasets. Overall, the Friedman-Rafsky test can be recommended for two samples as a compromise of high performance, acceptable resource consumption, and computational error occurrences. For the multi-sample case, the Multi-Sample Mahalanobis Cross-Match (MMCM) test can be recommended due to its comparably good performance and low resource consumption.

翻译：量化两个或多个数据集之间的相似性在统计学和机器学习领域具有广泛的应用。然而，由于现有方法众多且缺乏中立的比较研究（特别是针对分类数据），方法选择十分困难。本文比较了最有前景的方法在检测数据集间特定差异方面的能力及其资源消耗。结果表明，当比较两个数据集（即双样本情况）时，边计数检验表现良好。在某些场景下，约束最小距离（CM）表现更优。对于由每个变量包含五个类别的分类数据，最佳方法取决于分布差异的类型——要么是CM距离和某些基于图的检验表现最佳，要么是基于分类器的检验（C2ST）。这一趋势在多个数据集的场景中更为明显。总体而言，弗里德曼-拉夫斯检验作为双样本场景下高性能、可接受的资源消耗和计算误差发生率之间的折中方案值得推荐。对于多样本场景，鉴于其相对较好的性能和低资源消耗，多样本马氏交叉匹配检验（MMCM）值得推荐。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《异构观测数据中的联合因果推理》美国艾莫利大学、微软、约翰霍普金斯大学、哈佛大学、斯坦福大学等联合发表最新论文63页PDF

专知会员服务

29+阅读 · 2022年4月28日

当SVM碰上对比学习？霍普金斯/MIT学者在AAAI2022提出《最大化间隔对比学习》选择更好的负样例提升对比性能

专知会员服务

48+阅读 · 2021年12月22日

【博士论文】大数据相似查询关键技术研究

专知会员服务

24+阅读 · 2021年12月2日

【ICML2021】用于对比表示学习的分解互信息估计

专知会员服务

26+阅读 · 2021年9月9日