Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness, and textual knowledge) should be leveraged when selecting the concepts. Our proposed concept discovery and learning (CDL) framework is thus designed to identify a diverse list of generic visual concepts (e.g. "spiky" as opposed to "spiky durian"), which are ranked and selected based on visual and language mutual information. We carefully design quantitative and human evaluations of the discovered concepts on six diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. All code and models are publicly released.
翻译:视觉语言模型(VLMs)在预训练过程中学习为"榴莲"图像生成描述时,是否同时习得了"棕色"(颜色)和"多刺"(纹理)等视觉概念?本研究旨在回答该问题,因为"免费"习得的视觉概念将推动神经符号推理或人类可解释物体分类等广泛应用。我们假设,若预训练VLMs确实捕获了视觉概念,则可通过其视觉-语言接口配合基于文本的概念提示进行提取。研究发现,近期采用概念提示VLMs的工作在定义和评估视觉概念时往往采用不同策略,导致相互矛盾的结论。我们基于两点观察提出新的概念定义策略:首先,部分概念提示存在捷径,会因错误原因识别正确概念;其次,选择概念时应综合利用多模态信息(如视觉区分度与文本知识)。据此设计的"概念发现与学习"(CDL)框架能够识别多样化的通用视觉概念(例如"多刺"而非"多刺榴莲"),并依据视觉与语言的互信息进行排序筛选。我们在六个多样化视觉识别数据集上精心设计了定量与人工评估,证实预训练VLMs确实习得了能为识别对象提供准确全面描述的视觉概念。所有代码与模型均已开源发布。