Evaluating the quality of automatically generated image descriptions is a complex task that requires metrics capturing various dimensions, such as grammaticality, coverage, accuracy, and truthfulness. Although human evaluation provides valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr attempt to fill this gap, but they often exhibit weak correlations with human judgment. To address this challenge, we propose a novel evaluation framework called Image2Text2Image, which leverages diffusion models, such as Stable Diffusion or DALL-E, for text-to-image generation. In the Image2Text2Image framework, an input image is first processed by a selected image captioning model, chosen for evaluation, to generate a textual description. Using this generated description, a diffusion model then creates a new image. By comparing features extracted from the original and generated images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies, revealing potential weaknesses in the model's performance. Notably, our framework does not rely on human-annotated reference captions, making it a valuable tool for assessing image captioning models. Extensive experiments and human evaluations validate the efficacy of our proposed Image2Text2Image evaluation framework. The code and dataset will be published to support further research in the community.
翻译:自动生成的图像描述的质量评估是一项复杂任务,需要能够捕捉语法正确性、覆盖度、准确性及真实性等多个维度的度量标准。尽管人工评估能提供有价值的见解,但其成本高昂且耗时,存在局限性。现有的自动化指标如BLEU、ROUGE、METEOR和CIDEr试图填补这一空白,但它们与人类判断的相关性往往较弱。为应对这一挑战,我们提出了一种名为Image2Text2Image的新型评估框架,该框架利用Stable Diffusion或DALL-E等扩散模型进行文本到图像的生成。在Image2Text2Image框架中,首先由待评估的选定图像描述模型处理输入图像,生成文本描述。随后,扩散模型利用该生成的描述创建一幅新图像。通过比较从原始图像和生成图像中提取的特征,我们使用指定的相似性度量来计算它们的相似度。高相似度得分表明模型生成了忠实的文本描述,而低分则突显了差异,揭示了模型性能的潜在弱点。值得注意的是,我们的框架不依赖于人工标注的参考描述,这使其成为评估图像描述模型的有价值工具。大量实验和人工评估验证了我们提出的Image2Text2Image评估框架的有效性。代码和数据集将公开发布,以支持社区进一步研究。