Fr\'echet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the ImageNet classifications that aligning the histograms of Top-$N$ classifications between sets of generated and real images can reduce FID substantially -- without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.
翻译:Fréchet初始距离(FID)是数据驱动生成建模中用于模型排序的主要指标。尽管该指标取得了显著成功,但已知其有时会与人类判断产生分歧。我们研究了这些差异的根本原因,并可视化FID在生成图像中“关注”的内容。研究表明,FID通常用于计算的特征空间与ImageNet分类高度接近,以至于只需对齐生成图像和真实图像集合中前N个分类的直方图,就能显著降低FID——而实际上并未提升结果质量。因此,我们得出结论:FID容易受到有意或无意失真的影响。作为无意失真的一个实际案例,我们讨论了这样一种情况:一个基于ImageNet预训练的FastGAN在FID指标上可与StyleGAN2媲美,但在人类评估中表现更差。