Data augmentation forms the cornerstone of many modern machine learning training pipelines; yet, the mechanisms by which it works are not clearly understood. Much of the research on data augmentation (DA) has focused on improving existing techniques, examining its regularization effects in the context of neural network over-fitting, or investigating its impact on features. Here, we undertake a holistic examination of the effect of DA on three different classifiers, convolutional neural networks, support vector machines, and logistic regression models, which are commonly used in supervised classification of imbalanced data. We support our examination with testing on three image and five tabular datasets. Our research indicates that DA, when applied to imbalanced data, produces substantial changes in model weights, support vectors and feature selection; even though it may only yield relatively modest changes to global metrics, such as balanced accuracy or F1 measure. We hypothesize that DA works by facilitating variances in data, so that machine learning models can associate changes in the data with labels. By diversifying the range of feature amplitudes that a model must recognize to predict a label, DA improves a model's capacity to generalize when learning with imbalanced data.
翻译:数据增强构成了许多现代机器学习训练流程的基石;然而,其工作机制尚不明确。现有关于数据增强的研究主要集中于改进现有技术、探究其在神经网络过拟合背景下的正则化效应,或考察其对特征的影响。本文对数据增强在三种不同分类器(常用于非平衡数据监督分类的卷积神经网络、支持向量机和逻辑回归模型)上的影响进行了整体性考察。我们通过在三个图像数据集和五个表格数据集上的测试来支持我们的研究。研究结果表明,当数据增强应用于非平衡数据时,会导致模型权重、支持向量和特征选择发生显著变化;尽管其对全局指标(如平衡准确率或F1分数)的影响可能相对较小。我们假设数据增强通过促进数据的变异性发挥作用,使机器学习模型能够将数据变化与标签关联起来。通过多样化模型为预测标签必须识别的特征幅度范围,数据增强提升了模型在学习非平衡数据时的泛化能力。