Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate appearance descriptions from images of people. We apply this method to create approximately 250,000 captions for publicly available face datasets. We then use these synthetic captions to fine-tune a text-to-image diffusion model. Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces and enhances adherence to the given prompts, compared to the baseline model. We share our synthetic captions, pretrained checkpoints and training code.
翻译:近期基于扩散模型的文本到图像生成技术显著提升了生成图像的质量,并拓展了描绘多种物体的能力。然而,确保模型严格遵循文本提示仍是一项重大挑战。在尝试生成逼真人物图像时,这一问题尤为突出。若无显著的提示工程优化,模型常生成不真实的图像,且通常无法完整融合提示中的全部信息。这一局限主要源于训练大规模扩散模型所用的图像描述性质——与人物外貌细节相比,这些描述通常更侧重上下文信息。本文通过引入一种无需训练的流水线来解决该问题,该流水线可从人物图像中生成精准的外观描述。我们将该方法应用于公开人脸数据集,生成了约25万条描述。随后利用这些合成描述对文本到图像扩散模型进行微调。结果表明,与基准模型相比,本方法显著提升了模型生成高质量、真实感人脸的能力,并增强了模型对给定提示的遵循度。我们将公开共享合成描述、预训练检查点及训练代码。