Perceptual metrics, like the Fr\'echet Inception Distance (FID), are widely used to assess the similarity between synthetically generated and ground truth (real) images. The key idea behind these metrics is to compute errors in a deep feature space that captures perceptually and semantically rich image features. Despite their popularity, the effect that different deep features and their design choices have on a perceptual metric has not been well studied. In this work, we perform a causal analysis linking differences in semantic attributes and distortions between face image distributions to Fr\'echet distances (FD) using several popular deep feature spaces. A key component of our analysis is the creation of synthetic counterfactual faces using deep face generators. Our experiments show that the FD is heavily influenced by its feature space's training dataset and objective function. For example, FD using features extracted from ImageNet-trained models heavily emphasize hats over regions like the eyes and mouth. Moreover, FD using features from a face gender classifier emphasize hair length more than distances in an identity (recognition) feature space. Finally, we evaluate several popular face generation models across feature spaces and find that StyleGAN2 consistently ranks higher than other face generators, except with respect to identity (recognition) features. This suggests the need for considering multiple feature spaces when evaluating generative models and using feature spaces that are tuned to nuances of the domain of interest.
翻译:感知度量(如Fréchet Inception Distance,FID)被广泛用于评估合成图像与真实图像之间的相似性。这些度量的核心思想是在一个能够捕获感知和语义丰富图像特征的深层特征空间中计算误差。尽管这些度量广受欢迎,但不同深层特征及其设计选择对感知度量的影响尚未得到充分研究。本文中,我们采用因果分析的方法,使用多种流行的深层特征空间,将人脸图像分布间的语义属性差异和失真与Fréchet距离(FD)联系起来。我们分析的关键组成部分是利用深度人脸生成器生成合成反事实人脸。实验表明,FD受其特征空间的训练数据集和目标函数影响显著。例如,使用ImageNet训练模型提取特征的FD在区域(如眼睛和嘴巴)上会过度强调帽子。此外,使用人脸性别分类器特征的FD比身份(识别)特征空间中的距离更强调头发长度。最后,我们跨特征空间评估了几种流行的人脸生成模型,发现StyleGAN2始终优于其他生成器,但在身份(识别)特征方面除外。这提示我们在评估生成模型时需要考虑多种特征空间,并采用针对特定领域细微差别调整的特征空间。