In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard datasets created for evaluating multimodal text-image discriminative models to evaluate generative models in a fine-grained manner: assessing their performance on attribute binding, color recognition, counting, shape recognition, spatial understanding. To the best of our knowledge SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple models and benchmarks. Moreover, SelfEval enables us to evaluate generative models on challenging tasks such as Winoground image-score where they demonstrate competitive performance to discriminative models. We also show severe drawbacks of standard automated metrics such as CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and how SelfEval sidesteps these issues. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.
翻译:本文证明,文本到图像生成模型可以通过“逆向”操作,以完全自动化的方式评估其自身的文本-图像理解能力。我们提出的方法SelfEval利用生成模型计算给定文本提示下真实图像的似然,从而将生成模型直接应用于判别任务。通过SelfEval,我们将原本用于评估多模态文本-图像判别模型的标准数据集重新用于细粒度评估生成模型:评估其在属性绑定、颜色识别、计数、形状识别和空间理解方面的性能。据我们所知,SelfEval是首个在多个模型和基准测试中,与黄金标准人工评估在衡量文本忠实度方面表现出高度一致的自动化指标。此外,SelfEval使我们能够在Winoground图像评分等具有挑战性的任务上评估生成模型,且在这些任务中生成模型展现了与判别模型相当的性能。我们还揭示了CLIP评分等标准自动化指标在衡量DrawBench等基准测试中文本忠实度方面的严重缺陷,并说明了SelfEval如何规避这些问题。我们期望SelfEval能够为扩散模型提供简便且可靠的自动化评估方法。