Recently, large-scale vision-language pre-training models and visual semantic embedding methods have significantly improved image-text matching (ITM) accuracy on MS COCO 5K test set. However, it is unclear how robust these state-of-the-art (SOTA) models are when using them in the wild. In this paper, we propose a novel evaluation benchmark to stress-test the robustness of ITM models. To this end, we add various fooling images and captions to a retrieval pool. Specifically, we change images by inserting unrelated images, and change captions by substituting a noun, which can change the meaning of a sentence. We discover that just adding these newly created images and captions to the test set can degrade performances (i.e., Recall@1) of a wide range of SOTA models (e.g., 81.9% $\rightarrow$ 64.5% in BLIP, 66.1% $\rightarrow$ 37.5% in VSE$\infty$). We expect that our findings can provide insights for improving the robustness of the vision-language models and devising more diverse stress-test methods in cross-modal retrieval task. Source code and dataset will be available at https://github.com/pseulki/rococo.
翻译:近期,大规模视觉-语言预训练模型与视觉语义嵌入方法在MS COCO 5K测试集上显著提升了图文匹配(ITM)精度。然而,这些最先进的模型在实际应用中的鲁棒性尚不明确。本文提出一种新型评估基准,用于压力测试ITM模型的鲁棒性。为此,我们在检索池中引入多种干扰图像与描述文本:一方面通过插入无关图像改变图像内容,另一方面通过替换名词改变描述语义。研究发现,仅将新生成的图像与描述文本加入测试集,即可显著降低多种SOTA模型的性能(例如,BLIP的Recall@1从81.9%降至64.5%,VSE∞从66.1%降至37.5%)。期望本研究能为提升视觉-语言模型鲁棒性、设计跨模态检索任务中更多样化的压力测试方法提供启示。源代码与数据集将发布于https://github.com/pseulki/rococo。