The power of DNNs relies heavily on the quantity and quality of training data. However, collecting and annotating data on a large scale is often expensive and time-consuming. To address this issue, we explore a new task, termed dataset expansion, aimed at expanding a ready-to-use small dataset by automatically creating new labeled samples. To this end, we present a Guided Imagination Framework (GIF) that leverages cutting-edge generative models like DALL-E2 and Stable Diffusion (SD) to "imagine" and create informative new data from the input seed data. Specifically, GIF conducts data imagination by optimizing the latent features of the seed data in the semantically meaningful space of the prior model, resulting in the creation of photo-realistic images with new content. To guide the imagination towards creating informative samples for model training, we introduce two key criteria, i.e., class-maintained information boosting and sample diversity promotion. These criteria are verified to be essential for effective dataset expansion: GIF-SD obtains 13.5% higher model accuracy on natural image datasets than unguided expansion with SD. With these essential criteria, GIF successfully expands small datasets in various scenarios, boosting model accuracy by 36.9% on average over six natural image datasets and by 13.5% on average over three medical datasets. The source code is available at https://github.com/Vanint/DatasetExpansion.
翻译:摘要:深度神经网络(DNN)的强大性能高度依赖训练数据的数量与质量。然而,大规模数据采集和标注通常成本高昂且耗时。为应对这一挑战,我们探索了一项新任务——数据集扩展,旨在通过自动生成新标注样本来扩展可直接使用的小规模数据集。为此,我们提出引导式想象框架(Guided Imagination Framework, GIF),该框架利用DALL-E2和Stable Diffusion(SD)等前沿生成模型,从输入种子数据中“想象”并生成富有信息量的新数据。具体而言,GIF通过在先验模型的语义意义空间中优化种子数据的隐特征来实现数据想象,从而生成具有新内容的光真实感图像。为引导想象过程生成对模型训练有益的样本,我们引入两项关键准则:类别保持的信息增强与样本多样性促进。实验证明,这些准则对有效数据集扩展至关重要:在自然图像数据集上,GIF-SD相较于未经引导的SD扩展方法,模型准确率提升13.5%。凭借这些关键准则,GIF成功在多种场景下扩展小数据集:在六个自然图像数据集上平均提升模型准确率36.9%,在三个医学数据集上平均提升13.5%。源代码已开源至https://github.com/Vanint/DatasetExpansion。