In statistics and machine learning, measuring the similarity between two or more datasets is important for several purposes. The performance of a predictive model on novel datasets, referred to as generalizability, critically depends on how similar the dataset used for fitting the model is to the novel datasets. Exploiting or transferring insights between similar datasets is a key aspect of meta-learning and transfer-learning. In two-sample testing, it is checked, whether the underlying (multivariate) distributions of two datasets coincide or not. Extremely many approaches for quantifying dataset similarity have been proposed in the literature. A structured overview is a crucial first step for comparisons of approaches. We examine more than 100 methods and provide a taxonomy, classifying them into ten classes, including (i) comparisons of cumulative distribution functions, density functions, or characteristic functions, (ii) methods based on multivariate ranks, (iii) discrepancy measures for distributions, (iv) graph-based methods, (v) methods based on inter-point distances, (vi) kernel-based methods, (vii) methods based on binary classification, (viii) distance and similarity measures for datasets, (ix) comparisons based on summary statistics, and (x) different testing approaches. Here, we present an extensive review of these methods. We introduce the main underlying ideas, formal definitions, and important properties.
翻译:在统计学与机器学习中,量化两个或多个数据集之间的相似性对于多种目的至关重要。预测模型对新数据集的性能(即泛化能力)关键取决于拟合模型所用数据集与新数据集的相似程度。利用或迁移相似数据集之间的见解是元学习和迁移学习的核心要素。双样本检验旨在判断两个数据集的(多元)分布是否一致。文献中已提出大量用于量化数据集相似性的方法,而结构化概述是比较这些方法的关键第一步。我们研究了超过100种方法,提出一个分类体系,将其划分为十类,包括:(i)累积分布函数、密度函数或特征函数的比较;(ii)基于多元秩的方法;(iii)分布差异度量;(iv)基于图的方法;(v)基于点间距离的方法;(vi)基于核的方法;(vii)基于二元分类的方法;(viii)数据集的距离与相似性度量;(ix)基于汇总统计量的比较;以及(x)不同的检验方法。本文对这些方法进行了广泛综述,介绍了主要理论基础、形式化定义及重要性质。