As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. However, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at https://github.com/facebookresearch/EvalGIM/.
翻译:随着文本到图像生成模型的使用日益增多,其评估中采用的自动化基准测试方法也随之普及。然而,尽管各类指标与数据集层出不穷,能够提供跨越多数据集与多指标评估框架的统一基准测试库却为数不多。此外,日益稳健的基准测试方法的快速引入,要求评估库必须能够灵活适应新的数据集与指标。最后,在综合各项评估以提供关于模型性能的可操作结论方面,目前仍存在空白。为实现统一、灵活且可操作的评估,我们推出了EvalGIM(发音同“EvalGym”),一个用于评估生成式图像模型的库。EvalGIM广泛支持用于衡量文本到图像生成模型质量、多样性与一致性的数据集与指标。此外,EvalGIM在设计上将用户自定义的灵活性作为最高优先级,其结构允许以即插即用的方式新增数据集与指标。为获得可操作的评估洞察,我们引入了“评估练习”,以突出针对特定评估问题的结论。这些评估练习包含两种文本到图像生成模型最新评估方法的易用且可复现的实现:一致性-多样性-真实性帕累托前沿,以及跨群体性能差异的分解测量。EvalGIM还包含引入两种文本到图像生成模型新分析方法的评估练习:模型排序的稳健性分析,以及跨不同提示风格的平衡评估。我们鼓励使用EvalGIM进行文本到图像模型的探索,并欢迎通过https://github.com/facebookresearch/EvalGIM/ 贡献代码。