In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing images. However, the cost of creating such datasets is prohibitive for most researchers. To generate a single training pair, current methods fine-tune a pre-trained text-to-image model on the subject image to capture fine-grained details, then use the fine-tuned model to create images for the same subject based on creative text prompts. Consequently, constructing a large-scale dataset with millions of subjects can require hundreds of thousands of GPU hours. To tackle this problem, we propose Toffee, an efficient method to construct datasets for subject-driven editing and generation. Specifically, our dataset construction does not need any subject-level fine-tuning. After pre-training two generative models, we are able to generate infinite number of high-quality samples. We construct the first large-scale dataset for subject-driven image editing and generation, which contains 5 million image pairs, text prompts, and masks. Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower. To test the proposed dataset, we also propose a model which is capable of both subject-driven image editing and generation. By simply training the model on our proposed dataset, it obtains competitive results, illustrating the effectiveness of the proposed dataset construction framework.
翻译:在主题驱动的文本到图像生成中,近期研究通过使用包含大量图像对的合成数据集训练模型,取得了优越性能。基于这些数据集训练的生成模型,能够以零样本方式对任意测试图像中的特定主题生成与文本对齐的图像,其表现甚至优于需要在测试图像上进行额外微调的方法。然而,构建此类数据集的成本对多数研究者而言高昂得难以承受。为生成单个训练对,现有方法需先在主题图像上微调预训练的文本到图像模型以捕获细粒度细节,再基于创意文本提示利用微调后的模型生成同一主题的图像。因此,构建包含数百万主题的大规模数据集可能需要数十万GPU小时。为解决此问题,我们提出Toffee——一种用于主题驱动编辑与生成的高效数据集构建方法。具体而言,我们的数据集构建无需任何主题级微调。在预训练两个生成模型后,我们即可生成无限数量的高质量样本。我们构建了首个面向主题驱动图像编辑与生成的大规模数据集,包含500万图像对、文本提示和掩码。该数据集规模是此前最大数据集的5倍,但成本降低了数万GPU小时。为测试所提数据集,我们还提出了一种既能实现主题驱动图像编辑又能进行生成的模型。仅通过在我们的数据集上训练该模型,即获得了具有竞争力的结果,验证了所提数据集构建框架的有效性。