Automatic image captioning evaluation is critical for benchmarking and promoting advances in image captioning research. Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative. Instead, we humans can easily identify the problems of captions in details, e.g., which words are inaccurate and which salient objects are not described, and then rate the caption quality. To support such informative feedback, we propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC). Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level, and also provide a text precision score, a vision recall score and an overall quality score at coarse-grained level. The coarse-grained score of InfoMetIC achieves significantly better correlation with human judgements than existing metrics on multiple benchmarks. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation. Our code and datasets are publicly available at https://github.com/HAWLYQ/InfoMetIC.
翻译:自动图像描述评估对于推动图像描述研究的基准测试和发展至关重要。现有指标仅提供单一分数衡量描述质量,缺乏可解释性和信息量。相比之下,人类能够轻松识别描述细节中的问题,例如哪些词语不准确、哪些显著对象未被描述,进而对描述质量进行评分。为支持此类信息性反馈,我们提出了一种用于无参考图像描述评估的信息性指标(InfoMetIC)。给定图像和描述,InfoMetIC 可在细粒度层面报告错误词语和未被提及的图像区域,并在粗粒度层面提供文本精确度分数、视觉召回率分数和总体质量分数。在多个基准测试中,InfoMetIC 的粗粒度分数与人类判断的相关性显著优于现有指标。我们还构建了一个词级评估数据集,并验证了 InfoMetIC 在细粒度评估中的有效性。我们的代码和数据集已公开于 https://github.com/HAWLYQ/InfoMetIC。