Text-to-image synthesis has made encouraging progress and attracted lots of public attention recently. However, popular evaluation metrics in this area, like the Inception Score and Fr'echet Inception Distance, incur several issues. First of all, they cannot explicitly assess the perceptual quality of generated images and poorly reflect the semantic alignment of each text-image pair. Also, they are inefficient and need to sample thousands of images to stabilise their evaluation results. In this paper, we propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images using a pre-trained likelihood-based text-to-image generative model, i.e., a higher likelihood indicates better perceptual quality and better text-image alignment. To prevent the likelihood of being dominated by the non-crucial part of the generated image, we propose several new designs to develop a credit assignment strategy based on the semantic and perceptual significance of the image patches. In the experiments, we evaluate the proposed metric on multiple popular text-to-image generation models and datasets in accessing both the perceptual quality and the text-image alignment. Moreover, it can successfully assess the generation ability of these models with as few as a hundred samples, making it very efficient in practice.
翻译:文本到图像合成近期取得了令人鼓舞的进展并吸引了大量公众关注。然而,该领域的主流评估指标(如Inception Score和Fréchet Inception Distance)存在若干问题:首先,它们无法明确评估生成图像的感知质量,也难以反映每个文本-图像对的语义对齐程度;其次,这些指标效率低下,需采样数千张图像才能稳定评估结果。本文提出通过预训练的基于似然的文本-图像生成模型直接估计生成图像的似然度来评估文本-图像生成性能——即更高似然度表示更优的感知质量与更好的文本-图像对齐。为避免似然度被生成图像的非关键部分主导,我们基于图像块的语义与感知显著性设计信用分配策略,并提出若干新方案。实验部分,我们在多个主流文本-图像生成模型与数据集上,从感知质量与文本-图像对齐两个维度评估所提指标。此外,该指标仅需百级样本即可有效评估模型生成能力,在实践应用中具有极高效率。