The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object interactions, caption diversity, and specificity. Existing highly-engineered measures attempt to capture specific aspects, but fall short in providing a holistic score that aligns closely with human judgments. Here, we propose CLAIR, a novel method that leverages the zero-shot language modeling capabilities of large language models (LLMs) to evaluate candidate captions. In our evaluations, CLAIR demonstrates a stronger correlation with human judgments of caption quality compared to existing measures. Notably, on Flickr8K-Expert, CLAIR achieves relative correlation improvements over SPICE of 39.6% and over image-augmented methods such as RefCLIP-S of 18.3%. Moreover, CLAIR provides noisily interpretable results by allowing the language model to identify the underlying reasoning behind its assigned score. Code is available at https://davidmchan.github.io/clair/
翻译:机器生成图像描述的评估是一个有趣但始终充满挑战的问题。有效的评估指标必须考虑相似性的多个维度,包括语义相关性、视觉结构、对象交互、描述多样性和具体性。现有高度工程化的指标试图捕捉特定方面,但无法提供与人类判断高度一致的整体分数。本文提出CLAIR,一种利用大语言模型(LLMs)的零样本语言建模能力来评估候选描述的新方法。实验表明,与现有指标相比,CLAIR与人类对描述质量的判断具有更强的相关性。在Flickr8K-Expert数据集上,CLAIR相对于SPICE的相关系数提升达39.6%,相对于图像增强方法(如RefCLIP-S)提升达18.3%。此外,CLAIR通过让语言模型识别其评分背后的推理过程,提供了可粗略解释的结果。代码已开源在https://davidmchan.github.io/clair/