Data augmentation is a powerful technique to improve performance in applications such as image and text classification tasks. Yet, there is little rigorous understanding of why and how various augmentations work. In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting. First, we show that transformations that preserve the labels of the data can improve estimation by enlarging the span of the training data. Second, we show that transformations that mix data can improve estimation by playing a regularization effect. Finally, we validate our theoretical insights on MNIST. Based on the insights, we propose an augmentation scheme that searches over the space of transformations by how uncertain the model is about the transformed data. We validate our proposed scheme on image and text datasets. For example, our method outperforms random sampling methods by 1.24% on CIFAR-100 using Wide-ResNet-28-10. Furthermore, we achieve comparable accuracy to the SoTA Adversarial AutoAugment on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.
翻译:数据增强是提升图像与文本分类等任务性能的强大技术,然而各类增强方法为何有效及其作用机理仍缺乏严格的理论理解。本研究考虑一族线性变换,并探讨其在过参数化线性回归设定中对岭估计量的影响。首先,我们证明保持数据标签的变换可通过扩充训练数据张成空间来改进估计;其次,我们证明混合数据的变换可通过发挥正则化效应来改进估计。最后,我们在MNIST数据集上验证了理论见解。基于这些发现,我们提出一种增强方案——根据模型对变换后数据的不确定度在变换空间中进行搜索。我们在图像与文本数据集上验证了所提方案的有效性。例如,在CIFAR-100数据集上使用Wide-ResNet-28-10时,该方法比随机采样方法性能提升1.24%。此外,我们在CIFAR-10、CIFAR-100、SVHN和ImageNet数据集上取得了与当前最优对抗式自动增强方法相当的准确率。