While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications. Previous work has proposed to generate data for image classifier training given limited real data access. However, these methods struggle to generate in-distribution images or depict fine-grained features, thereby hindering the generalization of classification models trained on synthetic datasets. We propose DataDream, a framework for synthesizing classification datasets that more faithfully represents the real data distribution when guided by few-shot examples of the target classes. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets. We demonstrate the efficacy of DataDream through extensive experiments, surpassing state-of-the-art classification accuracy with few-shot data across 7 out of 10 datasets, while being competitive on the other 3. Additionally, we provide insights into the impact of various factors, such as the number of real-shot and generated images as well as the fine-tuning compute on model performance. The code is available at https://github.com/ExplainableML/DataDream.
翻译:尽管文本到图像扩散模型已在图像合成领域展现出最先进的性能,但其在下游应用中的有效性仍有待验证。先前的研究提出在真实数据有限的情况下生成用于图像分类器训练的数据。然而,这些方法难以生成分布内图像或描绘细粒度特征,从而阻碍了在合成数据集上训练的分类模型的泛化能力。我们提出DataDream框架,该框架能够在目标类别的少量示例引导下,合成更忠实反映真实数据分布的分类数据集。DataDream首先在少量真实图像上微调图像生成模型的LoRA权重,随后使用适配后的模型生成训练数据。接着,我们利用合成数据微调CLIP的LoRA权重,以在多种数据集上实现优于先前方法的下游图像分类性能。通过大量实验,我们证明了DataDream的有效性:在10个数据集中的7个上,仅使用少样本数据即超越了最先进的分类准确率,并在其余3个数据集上保持竞争力。此外,我们还深入分析了各类因素(如真实样本数量、生成图像数量以及微调计算量)对模型性能的影响。代码已发布于https://github.com/ExplainableML/DataDream。