We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within the embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language tasks such as classification, debiasing, and retrieval. Our results show that simple linear algebraic operations on embedding vectors can be used as compositional and interpretable methods for regulating the behavior of VLMs.
翻译:本文研究了预训练视觉-语言模型数据嵌入中的组合结构。传统上,组合性常与现有词汇表中词语嵌入的代数运算相关联。与此不同,我们尝试将编码器的表示近似为嵌入空间中一组较小向量的组合。这些向量可被视为在模型嵌入空间内直接生成概念的"理想词语"。我们首先提出从几何视角理解组合结构的框架,继而从概率角度阐明这些组合结构在视觉-语言模型嵌入中的内涵,为其实践中的产生机制提供直观解释。最后,我们通过实验探究CLIP嵌入中的此类结构,并评估其在解决分类、去偏和检索等不同视觉-语言任务中的效用。结果表明,嵌入向量的简单线性代数运算可作为调控视觉-语言模型行为的可组合、可解释方法。