Patch-level data augmentation techniques such as Cutout and CutMix have demonstrated significant efficacy in enhancing the performance of vision tasks. However, a comprehensive theoretical understanding of these methods remains elusive. In this paper, we study two-layer neural networks trained using three distinct methods: vanilla training without augmentation, Cutout training, and CutMix training. Our analysis focuses on a feature-noise data model, which consists of several label-dependent features of varying rarity and label-independent noises of differing strengths. Our theorems demonstrate that Cutout training can learn low-frequency features that vanilla training cannot, while CutMix training can learn even rarer features that Cutout cannot capture. From this, we establish that CutMix yields the highest test accuracy among the three. Our novel analysis reveals that CutMix training makes the network learn all features and noise vectors "evenly" regardless of the rarity and strength, which provides an interesting insight into understanding patch-level augmentation.
翻译:诸如Cutout和CutMix等图像块级数据增强技术已被证明在提升视觉任务性能方面具有显著效果。然而,对这些方法的全面理论理解仍然缺乏。本文研究了使用三种不同方法训练的两层神经网络:无增强的标准训练、Cutout训练和CutMix训练。我们的分析聚焦于一个特征-噪声数据模型,该模型包含多个具有不同稀有度的标签依赖特征以及具有不同强度的标签无关噪声。我们的定理证明,Cutout训练能够学习标准训练无法学习的低频特征,而CutMix训练甚至能够学习Cutout无法捕获的更稀有特征。由此,我们确立了在三种方法中CutMix能获得最高的测试准确率。我们新颖的分析表明,无论稀有度与强度如何,CutMix训练都能使网络“均匀地”学习所有特征和噪声向量,这为理解图像块级增强提供了有趣的见解。