With the rapid development of Artificial Intelligence Generated Content (AIGC), it has become common practice in many learning tasks to train or fine-tune large models on synthetic data due to the data-scarcity and privacy leakage problems. Albeit promising with unlimited data generation, owing to massive and diverse information conveyed in real images, it is challenging for text-to-image generative models to synthesize informative training data with hand-crafted prompts, which usually leads to inferior generalization performance when training downstream models. In this paper, we theoretically analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts. Then we correspondingly propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data. Specifically, we caption each real image with the advanced captioning model to obtain informative and faithful prompts that extract class-relevant information and clarify the polysemy of class names. The image captions and class names are concatenated to prompt generative models for training image synthesis. Extensive experiments on ImageNette, ImageNet-100, and ImageNet-1K verify that our method significantly improves the performance of models trained on synthetic training data, i.e., 10% classification accuracy improvements on average.
翻译:随着人工智能生成内容技术的快速发展,由于数据稀缺和隐私泄露问题,在许多学习任务中利用合成数据训练或微调大型模型已成为常见做法。尽管无限数据生成的前景令人期待,但由于真实图像中蕴含的海量多样化信息,文本到图像生成模型难以通过人工设计的提示词合成信息丰富的训练数据,这通常会导致下游模型训练时泛化性能较差。本文从理论上分析了合成数据的训练效果与提示词诱导的合成数据分布之间的关系,并相应地提出了一种简单有效的方法,使文本到图像生成模型能够合成更具信息量和多样性的训练数据。具体而言,我们利用先进的描述模型对每张真实图像生成描述,以获得信息丰富且保真的提示词,这些提示词能提取类别相关特征并阐明类别名称的多义性。将图像描述与类别名称拼接后,用于引导生成模型合成训练图像。在ImageNette、ImageNet-100和ImageNet-1K数据集上的大量实验证明,我们的方法显著提升了基于合成训练数据的模型性能,平均分类准确率提高了10%。