Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity. On fine-grained and cluttered datasets for classification and detection, ALIA surpasses traditional data augmentation and text-to-image generated data by up to 15\%, often even outperforming equivalent additions of real data. Code is avilable at https://github.com/lisadunlap/ALIA.
翻译:许多细粒度分类任务(例如稀有动物识别)的训练数据有限,因此基于这些数据集训练的分类器往往难以泛化到领域变化(如天气或地点改变)。为此,我们探索如何利用训练数据中可见领域的自然语言描述,结合在多样化预训练数据集上训练的大规模视觉模型,生成训练数据的有用变体。我们提出ALIA(自动语言引导图像增强)方法,该方法利用大规模视觉与语言模型自动生成数据集领域的自然语言描述,并通过语言引导的图像编辑增强训练数据。为确保数据完整性,基于原始数据集训练的模型会过滤掉微小图像编辑以及破坏类别相关信息的编辑。生成的图像在视觉上与原始训练数据一致,同时显著提升多样性。在细粒度与杂乱数据集的分类和检测任务中,ALIA 相比传统数据增强和文本生成图像方法提升高达15%,甚至常优于等量真实数据的添加。代码已开源在 https://github.com/lisadunlap/ALIA。