Joint vision-language models have shown great performance over a diverse set of tasks. However, little is known about their limitations, as the high dimensional space learned by these models makes it difficult to identify semantic errors. Recent work has addressed this problem by designing highly controlled probing task benchmarks. Our paper introduces a more scalable solution that relies on already annotated benchmarks. Our method consists of extracting a large set of diverse features from a vision-language benchmark and measuring their correlation with the output of the target model. We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs; we also uncover novel insights such as CLIP getting confused by concrete words. Our framework is available at https://github.com/MichiganNLP/Scalable-VLM-Probing and can be used with other multimodal models and benchmarks.
翻译:联合视觉语言模型在多种任务上表现优异,然而由于这些模型学习到的高维空间难以识别语义错误,其局限性仍鲜为人知。近期研究通过设计高度可控的探测任务基准来应对此问题。本文提出一种更可扩展的解决方案,该方案依赖已有标注基准。我们的方法包括从视觉语言基准中提取大量多样化特征,并测量这些特征与目标模型输出之间的相关性。研究结果证实了先前发现——CLIP模型如同词袋模型,对名词和动词表现更优;同时揭示了新见解,例如CLIP模型易被具体词汇混淆。我们开发的分析框架已开源至https://github.com/MichiganNLP/Scalable-VLM-Probing,可与其他多模态模型和基准配合使用。