A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation

Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.

翻译：文本到图像扩散模型在过去几年中取得了显著的能力飞跃，能够根据文本提示生成高质量且多样化的图像。然而，即使是最先进的模型也常常难以精确遵循提示中的所有指令。这些模型大多基于由（图像，标题）对组成的数据集进行训练，其中图像通常来自网络，而标题则是其HTML替代文本。一个显著的例子是由Stable Diffusion等模型使用的LAION数据集。本研究中，我们观察到这些标题通常质量较低，并认为这严重影响了模型理解文本提示中细微语义的能力。我们证明，通过使用专门的自动标注模型对语料库进行重标注，并在重新标注的数据集上训练文本到图像模型，模型在多个方面都获得了显著提升。首先，在整体图像质量方面：例如FID从基线17.87降至14.84，且根据人类评估，忠实图像生成提升了64.3%。其次，在语义对齐方面：例如语义对象准确率从78.90提升至84.34，计数对齐误差从1.44降至1.32，位置对齐从57.60提升至62.42。我们分析了重标注语料库的多种方法，并提供了证据表明，这种我们称之为RECAP的技术，既减少了训练与推理之间的差异，又为每个样本提供了更多信息，从而提高了样本效率，使模型能够更好地理解标题与图像之间的关系。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日