In recent years large visual-language (V+L) models have achieved great success in various downstream tasks. However, it is not well studied whether these models have a conceptual grasp of the visual content. In this work we focus on conceptual understanding of these large V+L models. To facilitate this study, we propose novel benchmarking datasets for probing three different aspects of content understanding, 1) \textit{relations}, 2) \textit{composition}, and 3) \textit{context}. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We experimented with many recent state-of-the-art V+L models and observe that these models mostly \textit{fail to demonstrate} a conceptual understanding. This study reveals several interesting insights such as that \textit{cross-attention} helps learning conceptual understanding, and that CNNs are better with \textit{texture and patterns}, while Transformers are better at \textit{color and shape}. We further utilize some of these insights and investigate a \textit{simple finetuning technique} that rewards the three conceptual understanding measures with promising initial results. The proposed benchmarks will drive the community to delve deeper into conceptual understanding and foster advancements in the capabilities of large V+L models. The code and dataset is available at: \url{https://tinyurl.com/vlm-robustness}
翻译:近年来,大型视觉-语言(V+L)模型在各类下游任务中取得了显著成功。然而,这些模型是否真正从概念层面理解视觉内容,尚未得到充分研究。本文聚焦于大型V+L模型的概念理解能力。为便于开展此项研究,我们提出了用于探测内容理解三个不同维度的新型基准数据集:1) 关系(relations),2) 组合(composition),3) 上下文(context)。这些探测方法基于认知科学理论,能够帮助判断V+L模型是否能够,例如,识别"被雪点缀的男人"这种不合理场景,或通过"位于海滩"这一上下文信息识别海滩家具。我们对多个最新最优V+L模型进行了实验,发现这些模型大多未能展示出概念理解能力。本研究揭示了若干有趣发现,例如交叉注意力机制有助于学习概念理解,CNN更擅长处理纹理与图案,而Transformer则更善于捕捉颜色与形状。基于这些发现,我们进一步探索了一种简单的微调技术——对三个概念理解指标给予奖励,并取得了令人瞩目的初步结果。所提出的基准数据集将推动学界更深入地挖掘概念理解问题,促进大型V+L模型能力的进步。代码与数据集获取地址:\url{https://tinyurl.com/vlm-robustness}