Label noise is common in large real-world datasets, and its presence harms the training process of deep neural networks. Although several works have focused on the training strategies to address this problem, there are few studies that evaluate the impact of data augmentation as a design choice for training deep neural networks. In this work, we analyse the model robustness when using different data augmentations and their improvement on the training with the presence of noisy labels. We evaluate state-of-the-art and classical data augmentation strategies with different levels of synthetic noise for the datasets MNist, CIFAR-10, CIFAR-100, and the real-world dataset Clothing1M. We evaluate the methods using the accuracy metric. Results show that the appropriate selection of data augmentation can drastically improve the model robustness to label noise, increasing up to 177.84% of relative best test accuracy compared to the baseline with no augmentation, and an increase of up to 6% in absolute value with the state-of-the-art DivideMix training strategy.
翻译:标签噪声在大型真实数据集中普遍存在,其存在会损害深度神经网络的训练过程。尽管已有研究聚焦于解决该问题的训练策略,但鲜有工作评估数据增强作为设计选择对深度神经网络训练的影响。本研究分析了使用不同数据增强时的模型鲁棒性,以及其在含噪声标签训练中的改进效果。我们针对MNIST、CIFAR-10、CIFAR-100数据集及真实数据集Clothing1M,采用不同级别的合成噪声,评估了前沿方法与经典数据增强策略。使用准确率指标进行方法评估。结果表明,合理选择数据增强可显著提升模型对标签噪声的鲁棒性:与未使用数据增强的基线相比,相对最佳测试准确率最高提升177.84%;采用前沿DivideMix训练策略时,绝对准确率提升达6%。