Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

Olivia Wiles,Chuhan Zhang,Isabela Albuquerque,Ivana Kajić,Su Wang,Emanuele Bugliarello,Yasumasa Onoe,Chris Knutsen,Cyrus Rashtchian,Jordi Pont-Tuset,Aida Nematzadeh

from arxiv, Data and code will be released at: https://github.com/google-deepmind/gecko_benchmark_t2i

While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.

翻译：虽然文本到图像（T2I）生成模型已变得无处不在，但它们并非总能生成与给定提示对齐的图像。以往工作通过提出度量指标、基准测试集和人类评判收集模板来评估T2I对齐程度，但这些组件的质量并未得到系统测量。带有人类评分的提示集通常规模较小，且评分可靠性——进而用于比较模型的提示集——也未得到评估。为填补这一空白，我们开展了广泛研究，系统评估自动评估指标与人类评分模板。本研究有三项主要贡献：（1）提出一个综合性基于技能的基准测试，可区分不同人类模板下的模型表现。该技能型基准将提示分类为子技能，使实践者不仅能定位哪些技能具有挑战性，还能明确技能在何种复杂度层级上成为难点。（2）我们收集了涵盖四种模板和四个T2I模型的人类评分，总计超过10万条标注。这使我们能够理解哪些差异源于提示固有的歧义性，哪些差异源于度量指标与模型质量的差异。（3）最终，我们提出一种新的基于问答的自动评估指标，该指标在我们新数据集、不同人类模板以及TIFA160上，与人类评分的相关性均优于现有指标。