The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.
翻译:最近,CLIP模型已被证明在各种跨模态任务中非常有效,包括评估由视觉与语言架构生成的描述。本文提出了一种基于对比学习的图像描述评估指标的新方法,即正增强对比学习分数(PAC-S),该方法创新地将对比视觉语义空间的学习与在精心策划的数据上添加生成图像和文本相结合。跨越多个数据集的实验表明,我们的新指标在图像和视频上与人类判断的相关性最高,优于现有基于参考的指标(如CIDEr和SPICE)以及无参考指标(如CLIP-Score)。最后,我们测试了所提指标在考虑流行图像描述方法时的系统级相关性,并评估了使用不同跨模态特征的影响。我们的源代码和训练模型公开可获取于:https://github.com/aimagelab/pacscore。