Improving face generation quality and prompt following with synthetic captions

Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate appearance descriptions from images of people. We apply this method to create approximately 250,000 captions for publicly available face datasets. We then use these synthetic captions to fine-tune a text-to-image diffusion model. Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces and enhances adherence to the given prompts, compared to the baseline model. We share our synthetic captions, pretrained checkpoints and training code.

翻译：近期基于扩散模型的文本到图像生成技术显著提升了生成图像的质量，并拓展了描绘多种物体的能力。然而，确保模型严格遵循文本提示仍是一项重大挑战。在尝试生成逼真人物图像时，这一问题尤为突出。若无显著的提示工程优化，模型常生成不真实的图像，且通常无法完整融合提示中的全部信息。这一局限主要源于训练大规模扩散模型所用的图像描述性质——与人物外貌细节相比，这些描述通常更侧重上下文信息。本文通过引入一种无需训练的流水线来解决该问题，该流水线可从人物图像中生成精准的外观描述。我们将该方法应用于公开人脸数据集，生成了约25万条描述。随后利用这些合成描述对文本到图像扩散模型进行微调。结果表明，与基准模型相比，本方法显著提升了模型生成高质量、真实感人脸的能力，并增强了模型对给定提示的遵循度。我们将公开共享合成描述、预训练检查点及训练代码。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

（CVPR2021）基于结构保持的弱监督目标定位

专知会员服务

21+阅读 · 2021年5月1日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日