Text-to-image synthesis refers to generating visual-realistic and semantically consistent images from given textual descriptions. Previous approaches generate an initial low-resolution image and then refine it to be high-resolution. Despite the remarkable progress, these methods are limited in fully utilizing the given texts and could generate text-mismatched images, especially when the text description is complex. We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN, which consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR). The proposed FF-Block integrates an attention block and several convolution layers to effectively fuse the fine-grained word-context features into the corresponding visual features, in which the text information is fully used to refine the initial image with more details. And the GSR is proposed to improve the global semantic consistency between linguistic and visual features during the refinement process. Extensive experiments on CUB-200 and COCO datasets demonstrate the superiority of FF-GAN over other state-of-the-art approaches in generating images with semantic consistency to the given texts.Code is available at https://github.com/haoranhfut/FF-GAN.
翻译:文本到图像合成旨在根据给定的文本描述生成视觉逼真且语义一致的图像。现有方法首先生成低分辨率初始图像,再将其优化为高分辨率图像。尽管取得了显著进展,但这些方法在充分利用给定文本方面存在局限,尤其在文本描述复杂时可能生成与文本不匹配的图像。本文提出一种新颖的基于细粒度图文融合的生成对抗网络FF-GAN,该网络包含两个模块:细粒度图文融合模块(FF-Block)和全局语义优化模块(GSR)。所提出的FF-Block通过集成注意力模块与多个卷积层,将细粒度的词上下文特征有效融合到对应视觉特征中,从而充分利用文本信息对初始图像进行细节优化。而GSR模块则用于在优化过程中提升语言特征与视觉特征之间的全局语义一致性。在CUB-200和COCO数据集上的大量实验表明,FF-GAN在生成与给定文本语义一致的图像方面优于其他先进方法。代码开源地址:https://github.com/haoranhfut/FF-GAN。