Existing text-to-image generation approaches have set high standards for photorealism and text-image correspondence, largely benefiting from web-scale text-image datasets, which can include up to 5~billion pairs. However, text-to-image generation models trained on domain-specific datasets, such as urban scenes, medical images, and faces, still suffer from low text-image correspondence due to the lack of text-image pairs. Additionally, collecting billions of text-image pairs for a specific domain can be time-consuming and costly. Thus, ensuring high text-image correspondence without relying on web-scale text-image datasets remains a challenging task. In this paper, we present a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs. Our experiments reveal that we can guide text-to-image generation models to be aware of the semantics of different image regions, by training the model to generate semantic labels for each pixel. We demonstrate that our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset, where text-image pairs are scarce. Codes are available in this https://pmh9960.github.io/research/GCDP
翻译:现有文本到图像生成方法在逼真度和图文一致性方面已设立高标准,这主要得益于大规模网络文本-图像数据集(这些数据集可包含多达50亿对样本)。然而,在特定领域数据集(如城市场景、医学图像和人脸)上训练的文本到图像生成模型,因缺乏文本-图像对而仍面临图文一致性较低的问题。此外,为特定领域收集数十亿文本-图像对既耗时又成本高昂。因此,在不依赖大规模网络文本-图像数据集的前提下确保高图文一致性仍是一项具有挑战性的任务。本文提出一种新颖方法,通过利用现有语义布局来增强图文一致性。具体而言,我们提出一种高斯-分类扩散过程,可同步生成图像及其对应的布局对。实验表明,通过训练模型为每个像素生成语义标签,可引导文本到图像生成模型感知不同图像区域的语义信息。在图文对稀缺的多模态CelebA-HQ和Cityscapes数据集上,我们的方法相比现有文本到图像生成方法实现了更高的图文一致性。代码见:https://pmh9960.github.io/research/GCDP