Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing images. However, the cost of creating such datasets is prohibitive for most researchers. To generate a single training pair, current methods fine-tune a pre-trained text-to-image model on the subject image to capture fine-grained details, then use the fine-tuned model to create images for the same subject based on creative text prompts. Consequently, constructing a large-scale dataset with millions of subjects can require hundreds of thousands of GPU hours. To tackle this problem, we propose Toffee, an efficient method to construct datasets for subject-driven editing and generation. Specifically, our dataset construction does not need any subject-level fine-tuning. After pre-training two generative models, we are able to generate infinite number of high-quality samples. We construct the first large-scale dataset for subject-driven image editing and generation, which contains 5 million image pairs, text prompts, and masks. Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower. To test the proposed dataset, we also propose a model which is capable of both subject-driven image editing and generation. By simply training the model on our proposed dataset, it obtains competitive results, illustrating the effectiveness of the proposed dataset construction framework.

翻译：在主题驱动的文本到图像生成中，近期研究通过使用包含大量图像对的合成数据集训练模型，取得了优越性能。基于这些数据集训练的生成模型，能够以零样本方式对任意测试图像中的特定主题生成与文本对齐的图像，其表现甚至优于需要在测试图像上进行额外微调的方法。然而，构建此类数据集的成本对多数研究者而言高昂得难以承受。为生成单个训练对，现有方法需先在主题图像上微调预训练的文本到图像模型以捕获细粒度细节，再基于创意文本提示利用微调后的模型生成同一主题的图像。因此，构建包含数百万主题的大规模数据集可能需要数十万GPU小时。为解决此问题，我们提出Toffee——一种用于主题驱动编辑与生成的高效数据集构建方法。具体而言，我们的数据集构建无需任何主题级微调。在预训练两个生成模型后，我们即可生成无限数量的高质量样本。我们构建了首个面向主题驱动图像编辑与生成的大规模数据集，包含500万图像对、文本提示和掩码。该数据集规模是此前最大数据集的5倍，但成本降低了数万GPU小时。为测试所提数据集，我们还提出了一种既能实现主题驱动图像编辑又能进行生成的模型。仅通过在我们的数据集上训练该模型，即获得了具有竞争力的结果，验证了所提数据集构建框架的有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日