The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.
翻译:CLIP模型近期被证明在多种跨模态任务中非常有效,包括对视觉-语言架构生成的描述进行评估。本文提出一种基于对比学习的图像描述评估指标新范式——正增强对比学习评分(PAC-S),该指标通过创新方式将对比视觉-语义空间的学习与生成图像及文本在精选数据上的融合相结合。跨多个数据集的实验表明,该新指标在图像与视频上均达到与人工判断的最高相关性,优于现有基于参考的指标(如CIDEr和SPICE)及无参考指标(如CLIP-Score)。最后,我们测试了该指标在主流图像描述方法中的系统级相关性,并评估了不同跨模态特征的影响。我们的源代码与预训练模型已公开于:https://github.com/aimagelab/pacscore。