Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

George Stein,Jesse C. Cresswell,Rasa Hosseinzadeh,Yi Sui,Brendan Leigh Ross,Valentin Villecroze,Zhaoyan Liu,Anthony L. Caterini,J. Eric T. Taylor,Gabriel Loaiza-Ganem

from arxiv, NeurIPS 2023. 53 pages, 29 figures, 12 tables. Code at https://github.com/layer6ai-labs/dgm-eval, reviews at https://openreview.net/forum?id=08zf7kTOoh

We systematically study a wide variety of generative models spanning semantically-diverse image datasets to understand and improve the feature extractors and metrics used to evaluate them. Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations. Comparing to 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3. We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization: none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage. To facilitate further development of generative models and their evaluation we release all generated image datasets, human evaluation data, and a modular library to compute 17 common metrics for 9 different encoders at https://github.com/layer6ai-labs/dgm-eval.

翻译：我们系统性地研究了涵盖语义多样化图像数据集的各种生成模型，以理解并改进用于评估它们的特征提取器和指标。利用心理物理学的最佳实践，我们通过开展迄今为止规模最大的生成模型评估实验，测量了人类对生成样本图像真实性的感知，发现现有指标中没有一个与人类评估结果强相关。通过比较用于评估生成模型整体性能、保真度、多样性、稀有性和记忆化的17种现代指标，我们发现，扩散模型在人类评判下展现的先进感知真实性并未反映在常用的指标（如FID）中。这种差异不能用生成样本的多样性来解释，但一个原因是过度依赖Inception-V3。我们通过研究替代的自监督特征提取器来解决这些缺陷，发现单个网络编码的语义信息强烈依赖于其训练过程，并表明DINOv2-ViT-L/14允许对生成模型进行更丰富的评估。接下来，我们研究了数据记忆化，发现生成模型在简单、较小的数据集（如CIFAR10）上确实会记忆训练样本，但在更复杂的数据集（如ImageNet）上则不一定。然而，我们的实验表明，当前指标无法正确检测记忆化：文献中没有指标能够将记忆化与其他现象（如欠拟合或模态收缩）区分开。为了促进生成模型及其评估的进一步发展，我们在https://github.com/layer6ai-labs/dgm-eval上发布了所有生成的图像数据集、人类评估数据以及一个用于计算9种不同编码器的17种常见指标的模块化库。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日