Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. TIFA is a reference-free metric that allows for fine-grained and interpretable evaluations of generated images. TIFA also has better correlations with human judgments than existing metrics. Based on this approach, we introduce TIFA v1.0, a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.). We present a comprehensive evaluation of existing text-to-image models using TIFA v1.0 and highlight the limitations and challenges of current models. For instance, we find that current text-to-image models, despite doing well on color and material, still struggle in counting, spatial relations, and composing multiple objects. We hope our benchmark will help carefully measure the research progress in text-to-image synthesis and provide valuable insights for further research.
翻译:尽管数以千计的研究人员、工程师和艺术家正积极改进文图生成模型,但系统仍常无法生成与文本输入精确对齐的图像。我们提出TIFA(基于问答的文图忠实性评估),这是一种通过视觉问答(VQA)自动衡量生成图像对文本输入忠实性的评估指标。具体而言,针对给定文本输入,我们利用语言模型自动生成若干问答对,再通过现有VQA模型能否基于生成图像正确回答这些问题来评估图像忠实性。TIFA是一种无参考指标,支持对生成图像进行细粒度且可解释的评估。与现有指标相比,TIFA与人类判断的相关性更优。基于该方法,我们推出TIFA v1.0基准测试集,包含4000个多样化文本输入及覆盖12个类别(如物体、计数等)的25000个问题。我们利用TIFA v1.0对现有文图生成模型进行全面评估,揭示当前模型的局限与挑战。例如,现有文图模型虽在颜色和材质上表现良好,但在计数、空间关系及多物体组合方面仍存在困难。本基准测试有望系统衡量文图合成领域的研究进展,并为后续研究提供重要启示。