The acquisition of large-scale, high-quality data is a resource-intensive and time-consuming endeavor. Compared to conventional Data Augmentation (DA) techniques (e.g. cropping and rotation), exploiting prevailing diffusion models for data generation has received scant attention in classification tasks. Existing generative DA methods either inadequately bridge the domain gap between real-world and synthesized images, or inherently suffer from a lack of diversity. To solve these issues, this paper proposes a new classification-oriented framework DreamDA, which enables data synthesis and label generation by way of diffusion models. DreamDA generates diverse samples that adhere to the original data distribution by considering training images in the original data as seeds and perturbing their reverse diffusion process. In addition, since the labels of the generated data may not align with the labels of their corresponding seed images, we introduce a self-training paradigm for generating pseudo labels and training classifiers using the synthesized data. Extensive experiments across four tasks and five datasets demonstrate consistent improvements over strong baselines, revealing the efficacy of DreamDA in synthesizing high-quality and diverse images with accurate labels. Our code will be available at https://github.com/yunxiangfu2001/DreamDA.
翻译:大规模高质量数据的获取是一项资源密集且耗时的工作。相较于传统数据增强技术(如裁剪和旋转),利用主流扩散模型进行数据生成在分类任务中尚未受到足够关注。现有生成式数据增强方法要么未能充分弥合真实图像与合成图像之间的领域差距,要么天然缺乏多样性。为解决这些问题,本文提出一种面向分类任务的新框架DreamDA,该框架通过扩散模型实现数据合成与标签生成。DreamDA将原始数据中的训练图像视为种子,通过扰动其逆向扩散过程,生成遵循原始数据分布的多样化样本。此外,由于生成数据的标签可能与其对应种子图像的标签不一致,我们引入自训练范式来生成伪标签,并利用合成数据训练分类器。在四个任务和五个数据集上的大量实验表明,该方法相较于强基线方法取得了一致性提升,验证了DreamDA在合成高质量、多样化图像及精确标签方面的有效性。我们的代码将发布在https://github.com/yunxiangfu2001/DreamDA。