Advancements like Generative Adversarial Networks have attracted the attention of researchers toward face image synthesis to generate ever more realistic images. Thereby, the need for the evaluation criteria to assess the realism of the generated images has become apparent. While FID utilized with InceptionV3 is one of the primary choices for benchmarking, concerns about InceptionV3's limitations for face images have emerged. This study investigates the behavior of diverse feature extractors -- InceptionV3, CLIP, DINOv2, and ArcFace -- considering a variety of metrics -- FID, KID, Precision\&Recall. While the FFHQ dataset is used as the target domain, as the source domains, the CelebA-HQ dataset and the synthetic datasets generated using StyleGAN2 and Projected FastGAN are used. Experiments include deep-down analysis of the features: $L_2$ normalization, model attention during extraction, and domain distributions in the feature space. We aim to give valuable insights into the behavior of feature extractors for evaluating face image synthesis methodologies. The code is publicly available at https://github.com/ThEnded32/AnalyzingFeatureExtractors.
翻译:生成对抗网络等技术的进步吸引了研究者对人脸图像合成领域的关注,以生成更加逼真的图像。因此,评估生成图像真实性的标准需求变得日益明显。虽然结合InceptionV3使用的FID是基准测试的主要选择之一,但InceptionV3在人脸图像处理上的局限性已引发担忧。本研究考察了多种特征提取器——InceptionV3、CLIP、DINOv2和ArcFace——在多种评估指标(FID、KID、Precision\&Recall)下的表现。以FFHQ数据集作为目标域,同时使用CelebA-HQ数据集以及基于StyleGAN2和Projected FastGAN生成的合成数据集作为源域。实验包含对特征的深入分析:$L_2$归一化、特征提取过程中的模型注意力机制以及特征空间中的域分布。我们旨在为评估人脸图像合成方法的特征提取器行为提供有价值的见解。代码公开于https://github.com/ThEnded32/AnalyzingFeatureExtractors。