A good metric, which promises a reliable comparison between solutions, is essential to a well-defined task. Unlike most vision tasks that have per-sample ground-truth, image synthesis targets generating \emph{unseen} data and hence is usually evaluated with a distributional distance between one set of real samples and another set of generated samples. This work provides an empirical study on the evaluation of synthesis performance by taking the popular generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses on how to represent a data point in the feature space, how to calculate a fair distance using selected samples, and how many instances to use from each set. Experiments on multiple datasets and settings suggest that (1) a group of models including both CNN-based and ViT-based architectures serve as reliable and robust feature extractors, (2) Centered Kernel Alignment (CKA) enables better comparison across various extractors and hierarchical layers in one model, and (3) CKA shows satisfactory sample efficiency and complements existing metrics (\textit{e.g.}, FID) in characterizing the similarity between two internal data correlations. These findings help us design a new measurement system, based on which we re-evaluate the state-of-the-art generative models in a consistent and reliable way.
翻译:可靠的度量标准是确保解决方案间可比性的关键,这对定义清晰的任务至关重要。与大多数具有逐样本真实标注的视觉任务不同,图像合成旨在生成**未见过的**数据,因此通常通过衡量一组真实样本与一组生成样本之间的分布距离来评估。本研究以广泛应用的生成对抗网络(GANs)作为生成模型的代表,对合成性能评估方法进行了实证研究。具体而言,我们深入分析了如何在特征空间中表示数据点、如何利用选定样本计算公平距离、以及每类样本应使用多少实例等多个核心问题。多数据集与多种设置下的实验结果表明:(1)包含CNN架构与ViT架构在内的模型组可作为可靠且鲁棒的特征提取器;(2)中心核对齐(CKA)方法能够更好地比较不同特征提取器及同一模型内各层级之间的特征;(3)CKA表现出优异的样本效率,并能有效补充现有度量标准(如FID)在刻画两组内部数据相关性相似度方面的不足。基于上述发现,我们设计了一套新型评估体系,并据此以一致且可靠的方式重新评估了当前最先进的生成模型。