A good metric, which promises a reliable comparison between solutions, is essential for any well-defined task. Unlike most vision tasks that have per-sample ground-truth, image synthesis tasks target generating unseen data and hence are usually evaluated through a distributional distance between one set of real samples and another set of generated samples. This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set. Extensive experiments conducted on multiple datasets and settings reveal several important findings. Firstly, a group of models that include both CNN-based and ViT-based architectures serve as reliable and robust feature extractors for measurement evaluation. Secondly, Centered Kernel Alignment (CKA) provides a better comparison across various extractors and hierarchical layers in one model. Finally, CKA is more sample-efficient and enjoys better agreement with human judgment in characterizing the similarity between two internal data correlations. These findings contribute to the development of a new measurement system, which enables a consistent and reliable re-evaluation of current state-of-the-art generative models.
翻译:一个优质的评估指标对于任何明确界定的任务至关重要,它能确保不同解决方案间的可靠比较。与大多数具有逐样本真实标签的视觉任务不同,图像合成任务旨在生成未见数据,因此通常通过一组真实样本与一组生成样本之间的分布距离进行评估。本研究以生成对抗网络(GANs)为生成模型的代表,对合成性能的评估进行了实证探究。具体而言,我们深入分析了多个影响因素,包括如何在表示空间中表示数据点、如何利用选定样本计算公平距离以及每组应使用的实例数量。在多个数据集和设置下进行的大量实验揭示了若干重要发现。首先,基于CNN和ViT架构的模型组可作为可靠且鲁棒的特征提取器用于测量评估。其次,中心核对齐(CKA)能在不同特征提取器及同一模型的层级之间提供更优的比较。最后,CKA更具样本效率,且在刻画两组内部数据相关性时与人类判断具有更好的一致性。这些发现有助于构建新的测量体系,从而能够对当前最先进的生成模型进行一致且可靠的重新评估。