In this paper, we propose a new method for the augmentation of numeric and mixed datasets. The method generates additional data points by utilizing cross-validation resampling and latent variable modeling. It is particularly efficient for datasets with moderate to high degrees of collinearity, as it directly utilizes this property for generation. The method is simple, fast, and has very few parameters, which, as shown in the paper, do not require specific tuning. It has been tested on several real datasets; here, we report detailed results for two cases, prediction of protein in minced meat based on near infrared spectra (fully numeric data with high degree of collinearity) and discrimination of patients referred for coronary angiography (mixed data, with both numeric and categorical variables, and moderate collinearity). In both cases, artificial neural networks were employed for developing the regression and the discrimination models. The results show a clear improvement in the performance of the models; thus for the prediction of meat protein, fitting the model to the augmented data resulted in a reduction in the root mean squared error computed for the independent test set by 1.5 to 3 times.
翻译:本文提出了一种用于数值型及混合型数据集增强的新方法。该方法通过交叉验证重采样与潜变量建模生成额外数据点,特别适用于中高度共线性数据集,因其直接利用该特性进行数据生成。该方法具有简单、快速且参数极少的优势,如本文所示,这些参数无需特定调优。我们在多个真实数据集上进行了测试,本文详细报告了两个案例的验证结果:基于近红外光谱的碎肉蛋白质含量预测(含高度共线性的全数值型数据)与冠状动脉造影患者分类鉴别(含数值变量与分类变量的混合型数据,具有中等共线性)。两个案例中均采用人工神经网络构建回归与判别模型。结果表明模型性能显著提升:在肉类蛋白质预测案例中,基于增强数据拟合模型使独立测试集的均方根误差降低了1.5至3倍。