Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.
翻译:文本到图像扩散模型在过去几年中取得了显著的能力飞跃,能够根据文本提示生成高质量且多样化的图像。然而,即使是最先进的模型也常常难以精确遵循提示中的所有指令。这些模型大多基于由(图像,标题)对组成的数据集进行训练,其中图像通常来自网络,而标题则是其HTML替代文本。一个显著的例子是由Stable Diffusion等模型使用的LAION数据集。本研究中,我们观察到这些标题通常质量较低,并认为这严重影响了模型理解文本提示中细微语义的能力。我们证明,通过使用专门的自动标注模型对语料库进行重标注,并在重新标注的数据集上训练文本到图像模型,模型在多个方面都获得了显著提升。首先,在整体图像质量方面:例如FID从基线17.87降至14.84,且根据人类评估,忠实图像生成提升了64.3%。其次,在语义对齐方面:例如语义对象准确率从78.90提升至84.34,计数对齐误差从1.44降至1.32,位置对齐从57.60提升至62.42。我们分析了重标注语料库的多种方法,并提供了证据表明,这种我们称之为RECAP的技术,既减少了训练与推理之间的差异,又为每个样本提供了更多信息,从而提高了样本效率,使模型能够更好地理解标题与图像之间的关系。