We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within the embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language tasks such as classification, debiasing, and retrieval. Our results show that simple linear algebraic operations on embedding vectors can be used as compositional and interpretable methods for regulating the behavior of VLMs.
翻译:我们研究了预训练视觉-语言模型(VLM)数据嵌入中的组合结构。传统上,组合性通常与已有词汇表中单词嵌入的代数运算相关联。与之相对,我们尝试将编码器的表示近似为嵌入空间中一组较小向量的组合。这些向量可以被视为在模型嵌入空间中直接生成概念的“理想词汇”。我们首先从几何视角提出了一个理解组合结构的框架,然后从概率角度解释了VLM嵌入中这些组合结构所蕴含的意义,提供了关于它们为何在实际中出现的直觉。最后,我们在CLIP的嵌入中实证探索了这些结构,并评估了它们在解决分类、去偏和检索等不同视觉-语言任务中的实用性。我们的结果表明,嵌入向量的简单线性代数运算可以作为组合且可解释的方法,用于调控VLM的行为。