We assemble a dataset of Creative-Commons-licensed (CC) images, which we use to train a set of open diffusion models that are qualitatively competitive with Stable Diffusion 2 (SD2). This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce. In turn, to address these challenges, we use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired with curated CC images. We then develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION-2B data needed to train existing SD2 models, but obtains comparable quality. These results indicate that we have a sufficient number of CC images (~70 million) for training high-quality models. Our training recipe also implements a variety of optimizations that achieve ~3X training speed-ups, enabling rapid model iteration. We leverage this recipe to train several high-quality text-to-image models, which we dub the CommonCanvas family. Our largest model achieves comparable performance to SD2 on a human evaluation, despite being trained on our CC dataset that is significantly smaller than LAION and using synthetic captions for training. We release our models, data, and code at https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md
翻译:我们整理了一个采用知识共享许可协议(CC)的图像数据集,并利用该数据集训练了一组在质量上与Stable Diffusion 2(SD2)具有竞争力的开源扩散模型。本任务面临两个挑战:(1)高分辨率CC图像缺乏训练文本到图像生成模型所需的标注文本;(2)CC图像相对稀缺。为此,我们采用直观的迁移学习技术,为经过筛选的CC图像生成一组高质量合成标注文本。随后,我们开发了一种兼具数据与计算效率的训练方案,该方案仅需现有SD2模型训练所需LAION-2B数据的3%,却能达到可比较的质量。这一结果表明,我们拥有足够数量(约7000万张)的CC图像来训练高质量模型。我们的训练方案还实现了多项优化,将训练速度提升约3倍,从而支持快速模型迭代。利用该方案,我们训练了多个高质量的文本到图像模型,统称为CommonCanvas系列。尽管训练数据集(CC图像集)远小于LAION数据集且采用合成标注文本,但经人工评估,我们的最大模型在性能上与SD2相当。我们将模型、数据和代码开源,访问地址为:https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md