SAFLEX: Self-Adaptive Augmentation via Feature Label Extrapolation

Data augmentation, a cornerstone technique in deep learning, is crucial in enhancing model performance, especially with scarce labeled data. While traditional techniques are effective, their reliance on hand-crafted methods limits their applicability across diverse data types and tasks. Although modern learnable augmentation methods offer increased adaptability, they are computationally expensive and challenging to incorporate within prevalent augmentation workflows. In this work, we present a novel, efficient method for data augmentation, effectively bridging the gap between existing augmentation strategies and emerging datasets and learning tasks. We introduce SAFLEX (Self-Adaptive Augmentation via Feature Label EXtrapolation), which learns the sample weights and soft labels of augmented samples provided by any given upstream augmentation pipeline, using a specifically designed efficient bilevel optimization algorithm. Remarkably, SAFLEX effectively reduces the noise and label errors of the upstream augmentation pipeline with a marginal computational cost. As a versatile module, SAFLEX excels across diverse datasets, including natural and medical images and tabular data, showcasing its prowess in few-shot learning and out-of-distribution generalization. SAFLEX seamlessly integrates with common augmentation strategies like RandAug, CutMix, and those from large pre-trained generative models like stable diffusion and is also compatible with frameworks such as CLIP's fine-tuning. Our findings highlight the potential to adapt existing augmentation pipelines for new data types and tasks, signaling a move towards more adaptable and resilient training frameworks.

翻译：数据增强作为深度学习中的一项核心技术，在提升模型性能方面至关重要，尤其是在标注数据稀缺的情况下。尽管传统增强技术行之有效，但其对手工设计方法的依赖限制了它们在不同数据类型和任务中的适用性。尽管现代可学习的增强方法提供了更强的适应性，但其计算成本高昂，且难以融入主流的增强工作流程。本文提出了一种新颖、高效的数据增强方法，有效弥合了现有增强策略与新兴数据集及学习任务之间的差距。我们引入了SAFLEX（通过特征标签外推的自适应增强），该方法利用专门设计的高效双层优化算法，学习由任何给定上游增强流程所提供的增强样本的样本权重与软标签。值得注意的是，SAFLEX能以极小的计算成本有效降低上游增强流程的噪声与标签错误。作为一个通用模块，SAFLEX在包括自然图像、医学图像和表格数据在内的多种数据集上表现卓越，展现了其在少样本学习和分布外泛化方面的强大能力。SAFLEX能够无缝集成RandAug、CutMix等常见增强策略，以及来自稳定扩散等大型预训练生成模型的增强方法，同时也兼容如CLIP微调等框架。我们的研究结果凸显了将现有增强流程适配于新数据类型和任务的潜力，标志着向更具适应性和鲁棒性的训练框架迈进。