Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness, and textual knowledge) should be leveraged when selecting the concepts. Our proposed concept discovery and learning (CDL) framework is thus designed to identify a diverse list of generic visual concepts (e.g. "spiky" as opposed to "spiky durian"), which are ranked and selected based on visual and language mutual information. We carefully design quantitative and human evaluations of the discovered concepts on six diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. All code and models are publicly released.
翻译:视觉语言模型(VLM)预训练用于描述"榴莲"图像时,是否同时学习了诸如"棕色"(颜色)和"带刺"(纹理)等视觉概念?我们旨在回答这一问题——若视觉概念能被"零成本"习得,将推动神经符号推理或可解释性物体分类等广泛应用。我们假设:若预训练VLM捕获了视觉概念,可通过基于文本概念提示的视觉-语言界面提取这些概念。观察发现,近期用概念提示VLM的研究在定义和评估视觉概念时策略各异,导致结论相互矛盾。基于两项观察提出新概念定义策略:其一,某些概念提示包含捷径,使模型因错误理由识别正确概念;其二,选择概念时应利用多模态信息(如视觉判别性、文本知识)。据此设计的CDL(概念发现与学习)框架,旨在识别多样化通用视觉概念列表(例如用"带刺"而非"带刺榴莲"),并基于视觉与语言互信息进行排序与筛选。我们在六个不同视觉识别数据集上精心设计了定量与人工评估,证实预训练VLM确实学习了能为识别对象提供准确详尽描述的视觉概念。所有代码与模型均已公开。