Vulnerability to lexical perturbation is a critical weakness of automatic evaluation metrics for image captioning. This paper proposes Perturbation Robust Multi-Lingual CLIPScore(PR-MCS), which exhibits robustness to such perturbations, as a novel reference-free image captioning metric applicable to multiple languages. To achieve perturbation robustness, we fine-tune the text encoder of CLIP with our language-agnostic method to distinguish the perturbed text from the original text. To verify the robustness of PR-MCS, we introduce a new fine-grained evaluation dataset consisting of detailed captions, critical objects, and the relationships between the objects for 3, 000 images in five languages. In our experiments, PR-MCS significantly outperforms baseline metrics in capturing lexical noise of all various perturbation types in all five languages, proving that PR-MCS is highly robust to lexical perturbations.
翻译:对词汇扰动的脆弱性是图像描述自动评价指标的关键缺陷。本文提出了一种对这类扰动具有鲁棒性的新指标——抗扰动鲁棒多语言CLIPScore(PR-MCS),它是一种适用于多种语言的无参考图像描述评价指标。为实现扰动鲁棒性,我们采用语言无关方法微调CLIP的文本编码器,使其能够区分扰动文本与原始文本。为验证PR-MCS的鲁棒性,我们引入了一个新的细粒度评估数据集,该数据集包含五种语言下3000张图像的详细描述、关键目标及目标间关系。实验表明,在五种语言的所有扰动类型中,PR-MCS在捕捉词汇噪声方面显著优于基线指标,证明其对词汇扰动具有高度鲁棒性。