We systematically study a wide variety of generative models spanning semantically-diverse image datasets to understand and improve the feature extractors and metrics used to evaluate them. Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations. Comparing to 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3. We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization: none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage. To facilitate further development of generative models and their evaluation we release all generated image datasets, human evaluation data, and a modular library to compute 17 common metrics for 9 different encoders at https://github.com/layer6ai-labs/dgm-eval.
翻译:我们系统性地研究了涵盖语义多样化图像数据集的各种生成模型,以理解并改进用于评估它们的特征提取器和指标。利用心理物理学的最佳实践,我们通过开展迄今为止规模最大的生成模型评估实验,测量了人类对生成样本图像真实性的感知,发现现有指标中没有一个与人类评估结果强相关。通过比较用于评估生成模型整体性能、保真度、多样性、稀有性和记忆化的17种现代指标,我们发现,扩散模型在人类评判下展现的先进感知真实性并未反映在常用的指标(如FID)中。这种差异不能用生成样本的多样性来解释,但一个原因是过度依赖Inception-V3。我们通过研究替代的自监督特征提取器来解决这些缺陷,发现单个网络编码的语义信息强烈依赖于其训练过程,并表明DINOv2-ViT-L/14允许对生成模型进行更丰富的评估。接下来,我们研究了数据记忆化,发现生成模型在简单、较小的数据集(如CIFAR10)上确实会记忆训练样本,但在更复杂的数据集(如ImageNet)上则不一定。然而,我们的实验表明,当前指标无法正确检测记忆化:文献中没有指标能够将记忆化与其他现象(如欠拟合或模态收缩)区分开。为了促进生成模型及其评估的进一步发展,我们在https://github.com/layer6ai-labs/dgm-eval上发布了所有生成的图像数据集、人类评估数据以及一个用于计算9种不同编码器的17种常见指标的模块化库。