Synthetically augmenting training datasets with diffusion models has become an effective strategy for improving the generalization of image classifiers. However, existing approaches typically increase dataset size by 10-30x and struggle to ensure generation diversity, leading to substantial computational overhead. In this work, we introduce TADA (TArgeted Diffusion Augmentation), a principled framework that selectively augments examples that are not learned early in training using faithful synthetic images that preserve semantic features while varying noise. We show that augmenting only this targeted subset consistently outperforms augmenting the entire dataset. Through theoretical analysis on a two-layer CNN, we prove that TADA improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Extensive experiments demonstrate that by augmenting only 30-40% of the training data, TADA improves generalization by up to 2.8% across diverse architectures including ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, TinyImageNet, and ImageNet, using optimizers such as SGD and SAM. Notably, TADA combined with SGD outperforms the state-of-the-art optimizer SAM on CIFAR-100 and TinyImageNet. Furthermore, TADA shows promising improvements on object detection benchmarks, demonstrating its applicability beyond image classification. Our code is available at https://github.com/BigML-CS-UCLA/TADA.
翻译:利用扩散模型对训练数据集进行合成增强已成为提升图像分类器泛化能力的有效策略。然而,现有方法通常将数据集规模扩大10-30倍,且难以确保生成多样性,导致显著的计算开销。本文提出TADA(定向扩散增强),一种原则性框架,通过使用保持语义特征但变化噪声的真实合成图像,选择性地增强训练早期未充分学习的样本。我们证明,仅增强这一定向子集的效果持续优于增强整个数据集。通过对双层CNN的理论分析,我们证明TADA通过促进特征学习速度的均匀性而不放大噪声来提升泛化性能。大量实验表明,仅增强30-40%的训练数据,TADA在CIFAR-10/100、TinyImageNet和ImageNet数据集上,对包括ResNet、ViT、ConvNeXt和Swin Transformer在内的多种架构,使用SGD和SAM等优化器时,泛化性能提升最高达2.8%。值得注意的是,TADA结合SGD在CIFAR-100和TinyImageNet上超越了当前最先进的优化器SAM。此外,TADA在目标检测基准上也显示出有前景的改进,证明了其超越图像分类任务的适用性。我们的代码公开于https://github.com/BigML-CS-UCLA/TADA。