In this paper, we study whether representations of primitive concepts--such as colors and shapes of object parts--emerge automatically within these pretrained VL models. We propose a two-step framework, Compositional Concept Mapping (CompMap), to investigate this. CompMap asks a VL model to generate concept activations with text prompts from a predefined list of primitive concepts, and then learns to construct an explicit composition model that maps the primitive concept activations (e.g. the likelihood of black tail or red wing) to composite concepts (e.g. a red-winged blackbird). We demonstrate that a composition model can be designed as a set operation, and show that a composition model is straightforward for machines to learn from ground truth primitive concepts (as a linear classifier). We thus hypothesize that if primitive concepts indeed emerge in a VL pretrained model, its primitive concept activations can be used to learn a composition model similar to the one designed by experts. We propose a quantitative metric to measure the degree of similarity, and refer to the metric as the interpretability of the learned primitive concept representations of VL models. We also measure the classification accuracy when using the primitive concept activations and the learned composition model to predict the composite concepts, and refer to it as the usefulness metric. Our study reveals that state-of-the-art VL pretrained models learn primitive concepts that are highly useful for fine-grained visual recognition on the CUB dataset, and compositional generalization tasks on the MIT-States dataset. However, we observe that the learned composition models have low interpretability in our qualitative analyses. Our results reveal the limitations of existing VL models, and the necessity of pretraining objectives that encourage the acquisition of primitive concepts.
翻译:本文研究预训练视觉语言(VL)模型中是否自动涌现出基本概念的表征——例如物体部件的颜色和形状。我们提出一个名为“组合概念映射”(CompMap)的两步框架来探究这一问题。CompMap首先要求VL模型通过预定义基本概念列表中的文本提示生成概念激活值,然后学习构建一个显式的组合模型,将基本概念激活值(如黑色尾巴或红色翅膀的概率)映射到复合概念(如红翅黑鹂)。我们证明组合模型可以设计为集合运算,并表明从真实基本概念(作为线性分类器)出发,机器能够直接学习到这种组合模型。基于此,我们假设:如果基本概念确实在VL预训练模型中涌现,那么其基本概念激活值可用于学习一个与专家设计的组合模型相似的模型。我们提出一个量化指标来衡量相似程度,并将该指标称为VL模型所学基本概念表征的可解释性。同时,我们衡量使用基本概念激活值及所学组合模型预测复合概念时的分类准确率,并将其称为实用性指标。研究表明,当前最先进的VL预训练模型学习到的基本概念在CUB数据集上的细粒度视觉识别以及MIT-States数据集上的组合泛化任务中具有高度实用性。然而,在定性分析中,我们发现所学的组合模型可解释性较低。本研究结果揭示了现有VL模型的局限性,以及进一步鼓励模型获取基本概念的预训练目标的必要性。