Perceptual metrics, like the Fr\'echet Inception Distance (FID), are widely used to assess the similarity between synthetically generated and ground truth (real) images. The key idea behind these metrics is to compute errors in a deep feature space that captures perceptually and semantically rich image features. Despite their popularity, the effect that different deep features and their design choices have on a perceptual metric has not been well studied. In this work, we perform a causal analysis linking differences in semantic attributes and distortions between face image distributions to Fr\'echet distances (FD) using several popular deep feature spaces. A key component of our analysis is the creation of synthetic counterfactual faces using deep face generators. Our experiments show that the FD is heavily influenced by its feature space's training dataset and objective function. For example, FD using features extracted from ImageNet-trained models heavily emphasize hats over regions like the eyes and mouth. Moreover, FD using features from a face gender classifier emphasize hair length more than distances in an identity (recognition) feature space. Finally, we evaluate several popular face generation models across feature spaces and find that StyleGAN2 consistently ranks higher than other face generators, except with respect to identity (recognition) features. This suggests the need for considering multiple feature spaces when evaluating generative models and using feature spaces that are tuned to nuances of the domain of interest.
翻译:感知指标(如Fréchet Inception Distance,FID)被广泛用于评估合成图像与真实图像之间的相似性。这些指标的核心思想是在深度特征空间中计算误差,该特征空间能够捕捉具有感知和语义丰富性的图像特征。尽管这些指标广受欢迎,但不同深度特征及其设计选择对感知指标的影响尚未得到充分研究。本研究通过多种流行的深度特征空间,对语义属性差异、人脸图像分布之间的畸变与Fréchet距离(FD)之间的因果关系进行了因果分析。我们分析的关键环节是利用深度人脸生成器创建合成反事实人脸。实验表明,FD受其特征空间的训练数据集和目标函数影响显著。例如,使用ImageNet训练模型提取的特征时,FD会过度强调帽子区域,而弱化眼睛和嘴部区域。此外,使用人脸性别分类器的特征时,FD对头发长度的敏感性高于基于身份(识别)特征空间的距离。最后,我们跨特征空间评估了多种流行的人脸生成模型,发现StyleGAN2始终优于其他生成器,但在身份识别特征空间上例外。这表明评估生成模型时需综合考虑多个特征空间,并选用针对特定领域细微差异调优的特征空间。