EvalGIM: A Library for Evaluating Generative Image Models

Melissa Hall,Oscar Mañas,Reyhane Askari-Hemmat,Mark Ibrahim,Candace Ross,Pietro Astolfi,Tariq Berrada Ifriqi,Marton Havasi,Yohann Benchetrit,Karen Ullrich,Carolina Braga,Abhishek Charnalia,Maeve Ryan,Mike Rabbat,Michal Drozdzal,Jakob Verbeek,Adriana Romero-Soriano

from arxiv, For code, see https://github.com/facebookresearch/EvalGIM/tree/main

As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. However, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at https://github.com/facebookresearch/EvalGIM/.

翻译：随着文本到图像生成模型的使用日益增多，其评估中采用的自动化基准测试方法也随之普及。然而，尽管各类指标与数据集层出不穷，能够提供跨越多数据集与多指标评估框架的统一基准测试库却为数不多。此外，日益稳健的基准测试方法的快速引入，要求评估库必须能够灵活适应新的数据集与指标。最后，在综合各项评估以提供关于模型性能的可操作结论方面，目前仍存在空白。为实现统一、灵活且可操作的评估，我们推出了EvalGIM（发音同“EvalGym”），一个用于评估生成式图像模型的库。EvalGIM广泛支持用于衡量文本到图像生成模型质量、多样性与一致性的数据集与指标。此外，EvalGIM在设计上将用户自定义的灵活性作为最高优先级，其结构允许以即插即用的方式新增数据集与指标。为获得可操作的评估洞察，我们引入了“评估练习”，以突出针对特定评估问题的结论。这些评估练习包含两种文本到图像生成模型最新评估方法的易用且可复现的实现：一致性-多样性-真实性帕累托前沿，以及跨群体性能差异的分解测量。EvalGIM还包含引入两种文本到图像生成模型新分析方法的评估练习：模型排序的稳健性分析，以及跨不同提示风格的平衡评估。我们鼓励使用EvalGIM进行文本到图像模型的探索，并欢迎通过https://github.com/facebookresearch/EvalGIM/ 贡献代码。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日