Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

In this paper, we study whether representations of primitive concepts--such as colors and shapes of object parts--emerge automatically within these pretrained VL models. We propose a two-step framework, Compositional Concept Mapping (CompMap), to investigate this. CompMap asks a VL model to generate concept activations with text prompts from a predefined list of primitive concepts, and then learns to construct an explicit composition model that maps the primitive concept activations (e.g. the likelihood of black tail or red wing) to composite concepts (e.g. a red-winged blackbird). We demonstrate that a composition model can be designed as a set operation, and show that a composition model is straightforward for machines to learn from ground truth primitive concepts (as a linear classifier). We thus hypothesize that if primitive concepts indeed emerge in a VL pretrained model, its primitive concept activations can be used to learn a composition model similar to the one designed by experts. We propose a quantitative metric to measure the degree of similarity, and refer to the metric as the interpretability of the learned primitive concept representations of VL models. We also measure the classification accuracy when using the primitive concept activations and the learned composition model to predict the composite concepts, and refer to it as the usefulness metric. Our study reveals that state-of-the-art VL pretrained models learn primitive concepts that are highly useful for fine-grained visual recognition on the CUB dataset, and compositional generalization tasks on the MIT-States dataset. However, we observe that the learned composition models have low interpretability in our qualitative analyses. Our results reveal the limitations of existing VL models, and the necessity of pretraining objectives that encourage the acquisition of primitive concepts.

翻译：本文研究预训练视觉语言（VL）模型中是否自动涌现出基本概念的表征——例如物体部件的颜色和形状。我们提出一个名为“组合概念映射”（CompMap）的两步框架来探究这一问题。CompMap首先要求VL模型通过预定义基本概念列表中的文本提示生成概念激活值，然后学习构建一个显式的组合模型，将基本概念激活值（如黑色尾巴或红色翅膀的概率）映射到复合概念（如红翅黑鹂）。我们证明组合模型可以设计为集合运算，并表明从真实基本概念（作为线性分类器）出发，机器能够直接学习到这种组合模型。基于此，我们假设：如果基本概念确实在VL预训练模型中涌现，那么其基本概念激活值可用于学习一个与专家设计的组合模型相似的模型。我们提出一个量化指标来衡量相似程度，并将该指标称为VL模型所学基本概念表征的可解释性。同时，我们衡量使用基本概念激活值及所学组合模型预测复合概念时的分类准确率，并将其称为实用性指标。研究表明，当前最先进的VL预训练模型学习到的基本概念在CUB数据集上的细粒度视觉识别以及MIT-States数据集上的组合泛化任务中具有高度实用性。然而，在定性分析中，我们发现所学的组合模型可解释性较低。本研究结果揭示了现有VL模型的局限性，以及进一步鼓励模型获取基本概念的预训练目标的必要性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日