Vision-language models (VLMs), serve as foundation models for multi-modal applications such as image captioning and text-to-image generation. Recent studies have highlighted limitations in VLM text encoders, particularly in areas like compositionality and semantic understanding, though the underlying reasons for these limitations remain unclear. In this work, we aim to address this gap by analyzing the syntactic information, one of the fundamental linguistic properties, encoded by the text encoders of VLMs. We perform a thorough analysis comparing VLMs with different objective functions, parameter size and training data size, and with uni-modal language models (ULMs) in their ability to encode syntactic knowledge. Our findings suggest that ULM text encoders acquire syntactic information more effectively than those in VLMs. The syntactic information learned by VLM text encoders is shaped primarily by the pre-training objective, which plays a more crucial role than other factors such as model architecture, model size, or the volume of pre-training data. Models exhibit different layer-wise trends where CLIP performance dropped across layers while for other models, middle layers are rich in encoding syntactic knowledge.
翻译:视觉语言模型(VLMs)作为多模态应用(如图像描述和文本到图像生成)的基础模型。近期研究揭示了VLM文本编码器在组合性和语义理解等领域的局限性,但这些局限的根本原因尚不明确。在本研究中,我们旨在通过分析VLMs文本编码器所编码的句法信息(语言的基本属性之一)来填补这一空白。我们进行了全面分析,比较了具有不同目标函数、参数量与训练数据量的VLMs,以及单模态语言模型(ULMs)在编码句法知识方面的能力。我们的研究结果表明,ULM文本编码器比VLM文本编码器更有效地获取句法信息。VLM文本编码器学习的句法信息主要由预训练目标塑造,该因素比模型架构、模型规模或预训练数据量等其他因素更为关键。不同模型表现出不同的层级趋势:CLIP模型在各层性能均下降,而其他模型的中层在编码句法知识方面表现最为丰富。