Combining datasets to increase the number of samples and improve model fitting

For many use cases, combining information from different datasets can be of interest to improve a machine learning model's performance, especially when the number of samples from at least one of the datasets is small. However, a potential challenge in such cases is that the features from these datasets are not identical, even though there are some commonly shared features among the datasets. To tackle this challenge, we propose a novel framework called Combine datasets based on Imputation (ComImp). In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets. This is useful when the datasets have a large number of features that are not shared between them. Furthermore, our framework can also be utilized for data preprocessing by imputing missing data, i.e., filling in the missing entries while combining different datasets. To illustrate the power of the proposed methods and their potential usages, we conduct experiments for various tasks: regression, classification, and for different data types: tabular data, time series data, when the datasets to be combined have missing data. We also investigate how the devised methods can be used with transfer learning to provide even further model training improvement. Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets. In addition, the methods can boost performance by a significant margin when combining small datasets together and can provide extra improvement when being used with transfer learning.

翻译：对于许多应用场景而言，结合不同数据集的信息有助于提升机器学习模型的性能，尤其是在至少一个数据集的样本数量较小时。然而，此类情况面临的一个潜在挑战是：尽管这些数据集之间存在部分共同特征，但它们的特征并不完全相同。为应对这一挑战，我们提出了一种名为ComImp（基于插补的数据集合并）的新型框架。此外，我们还提出了ComImp的一种变体——PCA-ComImp，该变体在主成分分析（PCA）的基础上先降维再合并数据集，当数据集中存在大量非共享特征时尤为有效。进一步地，我们的框架还可用于数据预处理，即在合并不同数据集的同时对缺失数据进行插补（即填充缺失条目）。为验证所提方法的有效性及潜在应用场景，我们针对不同任务（回归、分类）和不同数据类型（表格数据、时间序列数据，以及待合并数据集存在缺失数据的情况）开展了实验。我们还探究了如何将所设计的方法与迁移学习相结合，以进一步提升模型训练效果。结果表明，所提方法与迁移学习在一定程度上具有相似性：合并能在小数据集上显著提升预测模型的准确性。此外，当合并多个小数据集时，这些方法可大幅提升性能；与迁移学习结合使用时，还能带来额外的性能提升。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

【干货书】机器学习速查手册，135页pdf

专知会员服务

129+阅读 · 2020年11月20日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日