Joint vision-language models have shown great performance over a diverse set of tasks. However, little is known about their limitations, as the high dimensional space learned by these models makes it difficult to identify semantic errors. Recent work has addressed this problem by designing highly controlled probing task benchmarks. Our paper introduces a more scalable solution that relies on already annotated benchmarks. Our method consists of extracting a large set of diverse features from a vision-language benchmark and measuring their correlation with the output of the target model. We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs; we also uncover novel insights such as CLIP getting confused by concrete words. Our framework is available at https://github.com/MichiganNLP/Scalable-VLM-Probing and can be used with other multimodal models and benchmarks.
翻译:联合视觉-语言模型在多种任务中展现了卓越性能。然而,由于这些模型学习的高维空间导致语义错误难以识别,其局限性尚不明确。近期研究通过设计高度受控的探查任务基准来解决此问题。本文提出一种更具可扩展性的解决方案,该方案依赖已有标注基准。我们的方法包括:从视觉-语言基准中提取大量多样化特征,并测量这些特征与目标模型输出之间的相关性。我们验证了先前关于CLIP模型类似词袋模型、对名词和动词表现更优的发现;同时揭示了新见解,例如CLIP在处理具象词汇时会产生混淆。本框架已在 https://github.com/MichiganNLP/Scalable-VLM-Probing 开源,可适配其他多模态模型与基准。