Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model are consistent with the prompt-based answers. This kind of evaluation is naturally dependent on the quality of the underlying QG and VQA models. We identify and address several reliability challenges in existing QG/A work: (a) QG questions should respect the prompt (avoiding hallucinations, duplications, and omissions) and (b) VQA answers should be consistent (not asserting that there is no motorcycle in an image while also claiming the motorcycle is blue). We address these issues with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by formal semantics, which is adaptable to any QG/A frameworks. DSG produces atomic and unique questions organized in dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep inconsistent answers. With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions.
翻译:评估文本到图像模型素来困难。近期一种评估文本-图像一致性的有效方法基于QG/A(问题生成与回答),该方法利用预训练基础模型从提示中自动生成一组问题与答案,并通过视觉问答模型提取的答案是否与提示答案一致来对输出图像评分。此类评估天然依赖于底层QG和VQA模型的质量。我们识别并解决了现有QG/A工作中的若干可靠性挑战:(a) QG问题需尊重提示(避免幻觉、重复和遗漏),(b) VQA答案应保持一致性(例如,不能既断言图像中无摩托车,又称摩托车为蓝色)。我们通过戴维森场景图(DSG)解决这些问题——这是一个受形式语义学启发的经验性评估框架,可适配任何QG/A框架。DSG生成以依赖图组织的原子化唯一问题,其(i)确保适当的语义覆盖范围,并(ii)规避不一致答案。通过对多种模型配置(LLM、VQA和T2I)的大量实验与人工评估,我们实验证明DSG解决了上述挑战。最后,我们提出开源评估基准DSG-1k,包含1,060条提示,覆盖广泛细粒度语义类别且分布均衡。我们已发布DSG-1k提示及相应的DSG问题。