The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.
翻译:CLIP模型近期被证明在多种跨模态任务中非常有效,包括评估由视觉-语言架构生成的描述。本文提出一种新的基于对比学习的图像描述评估指标配方,即弱正增强对比学习评分(PAC-S),该方法创新性地将对比视觉-语义空间的学习与在整理数据上添加生成图像和文本相统一。跨多个数据集的实验表明,我们的新指标在图像和视频上均达到与人类判断的最高相关性,优于现有基于参考的指标(如CIDEr、SPICE)和无参考指标(如CLIP-Score)。最后,我们在考虑主流图像描述方法时测试了所提指标的系统级相关性,并评估了采用不同跨模态特征的影响。源代码与训练模型已公开于:https://github.com/aimagelab/pacscore。