As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.
翻译:如同许多机器学习问题一样,图像生成方法的进步依赖于良好的评估指标。最流行的指标之一是弗雷歇初始距离(FID)。FID估计真实图像与算法生成图像的Inception-v3特征分布之间的距离。我们指出了FID的重要缺陷:Inception对现代文本到图像模型生成的丰富多样内容表征能力差、错误的正态性假设以及较差的样本复杂度。我们呼吁重新评估FID作为生成图像主要质量指标的适用性。我们通过实验证明,FID与人类评分者相矛盾,无法反映迭代式文本到图像模型的渐进式改进,无法捕捉失真程度,并且在改变样本量时会产生不一致的结果。我们还提出了一种新的替代指标CMMD,该指标基于更丰富的CLIP嵌入和采用高斯径向基核的最大均值差异距离。它是一个无偏估计量,不对嵌入的概率分布做任何假设,且样本效率高。通过大量实验和分析,我们证明了基于FID的文本到图像模型评估可能不可靠,而CMMD提供了更稳健、更可靠的图像质量评估。