Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.
翻译:在评估机器生成的图像描述时,如何有效地与人类判断保持一致,是一个复杂而引人入胜的挑战。现有的评估指标(如CIDEr或CLIP-Score)在这方面存在不足,因为它们要么未考虑对应的图像,要么缺乏编码细粒度细节和惩罚幻觉的能力。为了克服这些问题,本文提出BRIDGE,一种新的可学习且无需参考的图像描述评估指标。它采用一个新颖的模块将视觉特征映射为密集向量,并将其整合到评估过程中构建的多模态伪描述中。该方法产生了一种多模态度量,能够恰当地融合输入图像的信息,且不依赖参考描述,从而弥合了人类判断与机器生成图像描述之间的差距。在多个数据集上的实验表明,与现有的无需参考的评估分数相比,我们的方案取得了最先进的结果。我们的源代码和训练模型已在以下网址公开:https://github.com/aimagelab/bridge-score。