Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation offers valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr aim to bridge this gap but often show weak correlations with human judgment. We address this challenge by introducing a novel evaluation framework rooted in a modern large language model (LLM), such as GPT-4 or Gemini, capable of image generation. In our proposed framework, we begin by feeding an input image into a designated image captioning model, chosen for evaluation, to generate a textual description. Using this description, an LLM then creates a new image. By extracting features from both the original and LLM-created images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the image captioning model has accurately generated textual descriptions, while a low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance. Human-annotated reference captions are not required in our proposed evaluation framework, which serves as a valuable tool for evaluating the effectiveness of image captioning models. Its efficacy is confirmed through human evaluation.
翻译:自动生成的图像描述的质量评估具有挑战性,需要能够捕捉语法正确性、覆盖范围、准确性和真实性等多方面特征的度量标准。虽然人工评估能提供有价值的见解,但其成本高且耗时,存在局限性。现有的自动化指标,如BLEU、ROUGE、METEOR和CIDEr,旨在弥合这一差距,但往往与人类判断的相关性较弱。我们通过引入一种新颖的评估框架来应对这一挑战,该框架植根于能够进行图像生成的现代大型语言模型(LLM),例如GPT-4或Gemini。在我们提出的框架中,首先将输入图像馈送到一个选定的、待评估的图像描述生成模型中,以生成文本描述。利用此描述,LLM随后创建一幅新图像。通过从原始图像和LLM创建的图像中提取特征,我们使用指定的相似性度量来衡量它们的相似性。高相似性分数表明图像描述生成模型准确地生成了文本描述,而低相似性分数则表明存在差异,揭示了模型性能的潜在缺陷。我们提出的评估框架不需要人工标注的参考描述,它是评估图像描述生成模型有效性的一个宝贵工具。其有效性已通过人工评估得到证实。