Existing automatic evaluation on text-to-image synthesis can only provide an image-text matching score, without considering the object-level compositionality, which results in poor correlation with human judgments. In this work, we propose LLMScore, a new framework that offers evaluation scores with multi-granularity compositionality. LLMScore leverages the large language models (LLMs) to evaluate text-to-image models. Initially, it transforms the image into image-level and object-level visual descriptions. Then an evaluation instruction is fed into the LLMs to measure the alignment between the synthesized image and the text, ultimately generating a score accompanied by a rationale. Our substantial analysis reveals the highest correlation of LLMScore with human judgments on a wide range of datasets (Attribute Binding Contrast, Concept Conjunction, MSCOCO, DrawBench, PaintSkills). Notably, our LLMScore achieves Kendall's tau correlation with human evaluations that is 58.8% and 31.2% higher than the commonly-used text-image matching metrics CLIP and BLIP, respectively.
翻译:现有文本到图像合成的自动评估方法仅能提供图像-文本匹配分数,未考虑对象级组合性,导致与人类判断的相关性较差。本文提出LLMScore,一种提供多粒度组合性评估分数的新框架。LLMScore利用大语言模型(LLMs)评估文本到图像模型。首先,它将图像转换为图像级和对象级视觉描述。然后,将评估指令输入LLMs以衡量合成图像与文本之间的对齐程度,最终生成附带解释的分数。我们的深入分析表明,在多种数据集(Attribute Binding Contrast、Concept Conjunction、MSCOCO、DrawBench、PaintSkills)上,LLMScore与人类判断的相关性最高。值得注意的是,我们的LLMScore与人类评估的Kendall's tau相关性分别比常用文本-图像匹配指标CLIP和BLIP高出58.8%和31.2%。