TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. TIFA is a reference-free metric that allows for fine-grained and interpretable evaluations of generated images. TIFA also has better correlations with human judgments than existing metrics. Based on this approach, we introduce TIFA v1.0, a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.). We present a comprehensive evaluation of existing text-to-image models using TIFA v1.0 and highlight the limitations and challenges of current models. For instance, we find that current text-to-image models, despite doing well on color and material, still struggle in counting, spatial relations, and composing multiple objects. We hope our benchmark will help carefully measure the research progress in text-to-image synthesis and provide valuable insights for further research.

翻译：尽管数以千计的研究人员、工程师和艺术家正积极改进文图生成模型，但系统仍常无法生成与文本输入精确对齐的图像。我们提出TIFA（基于问答的文图忠实性评估），这是一种通过视觉问答（VQA）自动衡量生成图像对文本输入忠实性的评估指标。具体而言，针对给定文本输入，我们利用语言模型自动生成若干问答对，再通过现有VQA模型能否基于生成图像正确回答这些问题来评估图像忠实性。TIFA是一种无参考指标，支持对生成图像进行细粒度且可解释的评估。与现有指标相比，TIFA与人类判断的相关性更优。基于该方法，我们推出TIFA v1.0基准测试集，包含4000个多样化文本输入及覆盖12个类别（如物体、计数等）的25000个问题。我们利用TIFA v1.0对现有文图生成模型进行全面评估，揭示当前模型的局限与挑战。例如，现有文图模型虽在颜色和材质上表现良好，但在计数、空间关系及多物体组合方面仍存在困难。本基准测试有望系统衡量文图合成领域的研究进展，并为后续研究提供重要启示。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【百度&北京大学】自然语言生成的保真性:分析、评价和优化方法的系统综述，Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods

专知会员服务

15+阅读 · 2022年3月11日