Data augmentation is a dominant method for reducing model overfitting and improving generalization. Most existing data augmentation methods tend to find a compromise in augmenting the data, \textit{i.e.}, increasing the amplitude of augmentation carefully to avoid degrading some data too much and doing harm to the model performance. We delve into the relationship between data augmentation and model performance, revealing that the performance drop with heavy augmentation comes from the presence of out-of-distribution (OOD) data. Nonetheless, as the same data transformation has different effects for different training samples, even for heavy augmentation, there remains part of in-distribution data which is beneficial to model training. Based on the observation, we propose a novel data augmentation method, named \textbf{DualAug}, to keep the augmentation in distribution as much as possible at a reasonable time and computational cost. We design a data mixing strategy to fuse augmented data from both the basic- and the heavy-augmentation branches. Extensive experiments on supervised image classification benchmarks show that DualAug improve various automated data augmentation method. Moreover, the experiments on semi-supervised learning and contrastive self-supervised learning demonstrate that our DualAug can also improve related method. Code is available at \href{https://github.com/shuguang99/DualAug}{https://github.com/shuguang99/DualAug}.
翻译:数据增强是减少模型过拟合、提升泛化能力的主流方法。现有大多数数据增强方法倾向于在增强幅度上寻求折中,即谨慎增加增强幅度以避免过度退化部分数据并损害模型性能。我们深入探究数据增强与模型性能之间的关系,揭示出强增强导致性能下降的根本原因在于离群数据(out-of-distribution, OOD)的存在。然而,由于相同的数据变换对不同训练样本会产生不同影响,即便在强增强条件下,仍存在部分有益于模型训练的内部分布数据。基于这一发现,我们提出名为**DualAug**的新型数据增强方法,旨在以合理的时间与计算成本,最大程度地将增强数据维持在分布内。我们设计了一种数据混合策略,融合来自基础增强分支与强增强分支的增强数据。在监督图像分类基准上的大量实验表明,DualAug能够提升多种自动化数据增强方法的性能。此外,在半监督学习与对比自监督学习上的实验证明,DualAug亦可改进相关方法。代码开源地址:https://github.com/shuguang99/DualAug。