A wide breadth of research has devised data augmentation approaches that can improve both accuracy and generalization performance for neural networks. However, augmented data can end up being far from the clean training data and what is the appropriate label is less clear. Despite this, most existing work simply uses one-hot labels for augmented data. In this paper, we show re-using one-hot labels for highly distorted data might run the risk of adding noise and degrading accuracy and calibration. To mitigate this, we propose a generic method AutoLabel to automatically learn the confidence in the labels for augmented data, based on the transformation distance between the clean distribution and augmented distribution. AutoLabel is built on label smoothing and is guided by the calibration-performance over a hold-out validation set. We successfully apply AutoLabel to three different data augmentation techniques: the state-of-the-art RandAug, AugMix, and adversarial training. Experiments on CIFAR-10, CIFAR-100 and ImageNet show that AutoLabel significantly improves existing data augmentation techniques over models' calibration and accuracy, especially under distributional shift.
翻译:大量研究提出了能够提升神经网络准确性与泛化性能的数据增强方法。然而,增强数据可能与原始训练数据存在显著差异,其对应的适当标签并不明确。尽管如此,现有工作大多直接对增强数据采用独热标签。本文指出,对高度畸变数据重复使用独热标签可能带来引入噪声、降低准确性与校准性的风险。为缓解这一问题,我们提出一种通用方法AutoLabel,基于清洁分布与增强分布之间的变换距离,自动学习增强数据标签的置信度。AutoLabel建立在标签平滑技术之上,并以预留验证集上的校准性能为优化导向。我们成功将AutoLabel应用于三种不同的数据增强技术:当前最先进的RandAug、AugMix以及对抗训练。在CIFAR-10、CIFAR-100和ImageNet上的实验表明,AutoLabel能显著提升现有数据增强技术在模型校准性与准确性方面的表现,尤其是在分布偏移场景下。