For many use cases, combining information from different datasets can be of interest to improve a machine learning model's performance, especially when the number of samples from at least one of the datasets is small. However, a potential challenge in such cases is that the features from these datasets are not identical, even though there are some commonly shared features among the datasets. To tackle this challenge, we propose a novel framework called Combine datasets based on Imputation (ComImp). In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets. This is useful when the datasets have a large number of features that are not shared between them. Furthermore, our framework can also be utilized for data preprocessing by imputing missing data, i.e., filling in the missing entries while combining different datasets. To illustrate the power of the proposed methods and their potential usages, we conduct experiments for various tasks: regression, classification, and for different data types: tabular data, time series data, when the datasets to be combined have missing data. We also investigate how the devised methods can be used with transfer learning to provide even further model training improvement. Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets. In addition, the methods can boost performance by a significant margin when combining small datasets together and can provide extra improvement when being used with transfer learning.
翻译:对于许多应用场景而言,结合不同数据集的信息有助于提升机器学习模型的性能,尤其是在至少一个数据集的样本数量较小时。然而,此类情况面临的一个潜在挑战是:尽管这些数据集之间存在部分共同特征,但它们的特征并不完全相同。为应对这一挑战,我们提出了一种名为ComImp(基于插补的数据集合并)的新型框架。此外,我们还提出了ComImp的一种变体——PCA-ComImp,该变体在主成分分析(PCA)的基础上先降维再合并数据集,当数据集中存在大量非共享特征时尤为有效。进一步地,我们的框架还可用于数据预处理,即在合并不同数据集的同时对缺失数据进行插补(即填充缺失条目)。为验证所提方法的有效性及潜在应用场景,我们针对不同任务(回归、分类)和不同数据类型(表格数据、时间序列数据,以及待合并数据集存在缺失数据的情况)开展了实验。我们还探究了如何将所设计的方法与迁移学习相结合,以进一步提升模型训练效果。结果表明,所提方法与迁移学习在一定程度上具有相似性:合并能在小数据集上显著提升预测模型的准确性。此外,当合并多个小数据集时,这些方法可大幅提升性能;与迁移学习结合使用时,还能带来额外的性能提升。