We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training. In this work, instead of simply generating pseudo-ground-truth sentences of training images using existing image captioning methods, we employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space and, consequently, works well on zero-shot recognition tasks. We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings. To better align data in the two domains, we employ a principled way based on a variational inference, which efficiently estimates an approximate posterior of the hidden text embedding given an image and its CLIP feature. Experimental results validate that the proposed framework outperforms existing approaches by large margins under unsupervised and semi-supervised text-to-image generation settings.
翻译:我们提出了一种基于深度神经网络的文本到图像生成算法,该算法在训练过程中无需图像的文本描述。在本工作中,我们并未简单地利用现有图像描述方法为训练图像生成伪真实句子,而是采用了预训练的CLIP模型,该模型能够在联合空间中有效对齐图像和对应文本的嵌入表示,从而在零样本识别任务中表现优异。我们通过最大化以图像-文本CLIP嵌入对为条件的数据对数似然来优化文本到图像生成模型。为了更好地对齐两个领域的数据,我们基于变分推断采用了一种原则性方法,该方法能够高效地估计给定图像及其CLIP特征时隐藏文本嵌入的近似后验。实验结果验证了所提出的框架在无监督和半监督文本到图像生成设置下大幅优于现有方法。