Methods for quantifying the similarity of datasets are relevant in applications where two or more datasets, or their underlying distributions, need to be compared, ranging from two- and k-sample testing to applications in machine learning and synthetic data generation. Many methods for quantifying the similarity of datasets are available from the literature, but due to the lack of neutral comparison studies, it is unclear which method to choose when. Here, 36 methods applicable to continuous data are compared across various scenarios, including two or more datasets drawn from different distributions. Several deviations between datasets are considered, including shift and scale alternatives or differences in higher moments. An overall method ranking is established based on the methods' abilities to differentiate between datasets from different distributions, combined with computational aspects. Based on this, concrete decision rules for finding the best method based on characteristics of the datasets are determined. Moreover, combinations of four to six methods are proposed in the two-sample case such that in 90% to 95% of the considered scenarios, at least one of these methods is almost as good as the best method. In the multi-sample case, a combination of two to three methods is proposed analogously.
翻译:量化数据集相似性的方法在需要比较两个或多个数据集(或其潜在分布)的应用中具有重要意义,涵盖从双样本与多样本检验到机器学习及合成数据生成等领域。文献中存在多种量化数据集相似性的方法,但由于缺乏中立性比较研究,尚不清楚何时应选择何种方法。本研究针对36种适用于连续数据的方法,在多种场景(包括从不同分布中抽取的两个及以上数据集)下进行比较。实验考虑了数据集之间的多种偏差类型,包括位移与尺度替代方案及高阶矩差异。基于方法区分不同分布数据集的能力与计算代价,建立了整体方法排名,并据此确定了根据数据集特征选择最佳方法的具体决策规则。此外,针对双样本情形提出了四至六种方法的组合方案,使得在90%至95%的测试场景中,至少有一种方法的表现接近最优方法。在多样本情形中,相应地提出了二至三种方法的组合方案。