One of the major challenges in training deep neural networks for text-to-image generation is the significant linguistic discrepancy between ground-truth captions of each image in most popular datasets. The large difference in the choice of words in such captions results in synthesizing images that are semantically dissimilar to each other and to their ground-truth counterparts. Moreover, existing models either fail to generate the fine-grained details of the image or require a huge number of parameters that renders them inefficient for text-to-image synthesis. To fill this gap in the literature, we propose using the contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss to increase the semantic consistency between generated images of the same caption, and fake-to-real loss to reduce the gap between the distributions of real images and fake ones. We test this approach on two baseline models: SSAGAN and AttnGAN (with style blocks to enhance the fine-grained details of the images.) Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset. Additionally, on the challenging COCO dataset, our approach achieves competitive results against the state-of-the-art Lafite model, outperforms the FID score of SSAGAN model by 44.
翻译:训练深度神经网络进行文本到图像生成的主要挑战之一是,大多数流行数据集中每张图像的真实描述存在显著的语言差异。这些描述中词汇选择的巨大差异,导致生成的图像在语义上彼此不相似,也与真实图像存在差异。此外,现有模型要么无法生成图像的细粒度细节,要么需要大量参数,使其在文本到图像合成中效率低下。为填补这一文献空白,我们提出采用对比学习方法,并结合两种损失函数的新颖组合:假-假损失用于增强同一描述下生成图像之间的语义一致性,假-真损失用于缩小真实图像与生成图像分布之间的差距。我们在两个基线模型上测试了该方法:SSAGAN和AttnGAN(均包含样式块以增强图像的细粒度细节)。结果表明,我们的方法在CUB数据集上改进了带有样式块的AttnGAN的定性结果。此外,在具有挑战性的COCO数据集上,我们的方法取得了与当前最先进的Lafite模型相竞争的结果,并将SSAGAN模型的FID分数提高了44%。