Quantifying the similarity between datasets has widespread applications in statistics and machine learning. The performance of a predictive model on novel datasets, referred to as generalizability, depends on how similar the training and evaluation datasets are. Exploiting or transferring insights between similar datasets is a key aspect of meta-learning and transfer-learning. In simulation studies, the similarity between distributions of simulated datasets and real datasets, for which the performance of methods is assessed, is crucial. In two- or $k$-sample testing, it is checked, whether the underlying distributions of two or more datasets coincide. Extremely many approaches for quantifying dataset similarity have been proposed in the literature. We examine more than 100 methods and provide a taxonomy, classifying them into ten classes. In an extensive review of these methods the main underlying ideas, formal definitions, and important properties are introduced. We compare the 118 methods in terms of their applicability, interpretability, and theoretical properties, in order to provide recommendations for selecting an appropriate dataset similarity measure based on the specific goal of the dataset comparison and on the properties of the datasets at hand. An online tool facilitates the choice of the appropriate dataset similarity measure.
翻译:量化数据集之间的相似性在统计学与机器学习领域具有广泛的应用。预测模型在新数据集上的性能(即泛化能力)取决于训练数据集与评估数据集之间的相似程度。在元学习与迁移学习中,利用或迁移相似数据集间的知识是其核心环节。在仿真研究中,用于评估方法性能的仿真数据集分布与真实数据集分布之间的相似性至关重要。在双样本或$k$样本检验中,需要验证两个或多个数据集背后的分布是否一致。文献中已提出极大量的数据集相似性量化方法。本文考察了超过100种方法,并将其归纳为十类进行系统分类。通过对这些方法的全面综述,介绍了其主要核心思想、形式化定义及重要特性。我们从适用性、可解释性与理论特性三个维度对118种方法进行了比较,旨在根据数据集比较的具体目标及现有数据集的特性,为选择合适的数据集相似性度量提供建议。一个在线工具可辅助用户选择恰当的数据集相似性度量方法。