Mixup, a simple data augmentation method that randomly mixes two data points via linear interpolation, has been extensively applied in various deep learning applications to gain better generalization. However, the theoretical underpinnings of its efficacy are not yet fully understood. In this paper, we aim to seek a fundamental understanding of the benefits of Mixup. We first show that Mixup using different linear interpolation parameters for features and labels can still achieve similar performance to the standard Mixup. This indicates that the intuitive linearity explanation in Zhang et al., (2018) may not fully explain the success of Mixup. Then we perform a theoretical study of Mixup from the feature learning perspective. We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data). In contrast, standard training can only learn the common features but fails to learn the rare features, thus suffering from bad generalization performance. Moreover, our theoretical analysis also shows that the benefits of Mixup for feature learning are mostly gained in the early training phase, based on which we propose to apply early stopping in Mixup. Experimental results verify our theoretical findings and demonstrate the effectiveness of the early-stopped Mixup training.
翻译:Mixup作为一种简单的数据增强方法,通过线性插值随机混合两个数据点,已被广泛用于各类深度学习任务以提升泛化性能。然而,其有效性的理论基础尚未完全明晰。本文旨在深入理解Mixup的优势机理。我们首先证明,采用不同线性插值参数处理特征与标签的Mixup仍能达到与传统Mixup相近的性能,这表明Zhang等人(2018)提出的直观线性解释可能无法完全说明Mixup的成功机理。随后我们从特征学习视角开展理论研究:在特征-噪声数据模型框架下,论证Mixup训练能通过将罕见特征(出现在小部分数据中)与常见特征(出现在大部分数据中)混合,有效学习前者。相比之下,标准训练仅能学习常见特征而无法习得罕见特征,导致泛化性能不佳。此外,理论分析表明Mixup对特征学习的增益主要集中于训练早期阶段,据此我们提出在Mixup中应用早停策略。实验结果验证了理论发现,并证实了早停式Mixup训练的有效性。