We investigate compositional structures in vector data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate label representations from a text encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" which can be used to generate new concepts in an efficient way. We present a theoretical framework for understanding linear compositionality, drawing connections with mathematical representation theory and previous definitions of disentanglement. We provide theoretical and empirical evidence that ideal words provide good compositional approximations of composite concepts and can be more effective than token-based decompositions of the same concepts.
翻译:我们研究了预训练视觉语言模型(VLM)中向量数据嵌入的组合结构。传统上,组合性通常与现有词汇中词嵌入的代数运算相关联。与之相对,我们试图将文本编码器中的标签表示近似为嵌入空间中一组较小向量的组合。这些向量可被视为"理想词",能够高效地用于生成新概念。我们提出了一个理解线性组合性的理论框架,将其与数学表示理论和先前关于解缠结的定义建立关联。我们提供的理论与实验证据表明,理想词能够对复合概念提供良好的组合近似,且其效果通常优于基于同一概念token分解的方法。