Text-to-image synthesis refers to generating visual-realistic and semantically consistent images from given textual descriptions. Previous approaches generate an initial low-resolution image and then refine it to be high-resolution. Despite the remarkable progress, these methods are limited in fully utilizing the given texts and could generate text-mismatched images, especially when the text description is complex. We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN, which consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR). The proposed FF-Block integrates an attention block and several convolution layers to effectively fuse the fine-grained word-context features into the corresponding visual features, in which the text information is fully used to refine the initial image with more details. And the GSR is proposed to improve the global semantic consistency between linguistic and visual features during the refinement process. Extensive experiments on CUB-200 and COCO datasets demonstrate the superiority of FF-GAN over other state-of-the-art approaches in generating images with semantic consistency to the given texts.Code is available at https://github.com/haoranhfut/FF-GAN.
翻译:文本到图像合成是指从给定的文本描述生成视觉逼真且语义一致的图像。先前的方法首先生成初始的低分辨率图像,然后将其精炼为高分辨率图像。尽管取得了显著进展,但这些方法在充分利用给定文本方面存在局限,尤其在文本描述复杂时,可能生成与文本不匹配的图像。我们提出了一种新颖的基于细粒度文本-图像融合的生成对抗网络,称为FF-GAN,它包含两个模块:细粒度文本-图像融合块(FF-Block)和全局语义精炼(GSR)。所提出的FF-Block整合了一个注意力块和多个卷积层,以将细粒度的词上下文特征有效地融合到相应的视觉特征中,从而充分利用文本信息以更多细节精炼初始图像。同时,GSR旨在提高精炼过程中语言特征与视觉特征之间的全局语义一致性。在CUB-200和COCO数据集上的大量实验表明,FF-GAN在生成与给定文本语义一致的图像方面优于其他最先进的方法。代码可在https://github.com/haoranhfut/FF-GAN获取。