While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image--text representations, constructed automatically via large language models. Our results show that foundation VLMs exhibit zero-shot hierarchical understanding, surpassing the performance of prior models explicitly designed for this purpose. Furthermore, we show that foundation models may be better aligned to hierarchical reasoning via a text-only fine-tuning phase, while retaining pretraining knowledge.
翻译:尽管近期如CLIP等视觉-语言模型(VLMs)为在共享语义空间中分析文本与图像提供了强大工具,但这些模型并未显式建模描述图像时可能存在的文本集合的层次化特性。反之,现有的多模态层次表征学习方法需要从头开始进行成本高昂的训练,未能充分利用最先进多模态基础模型所编码的知识。在本研究中,我们探究了现有基础模型的知识结构,发现尽管未直接针对此目标进行训练,它们仍展现出对视觉-语义层次结构的涌现式理解。我们提出径向嵌入(RE)框架用于探测和优化层次理解能力,并构建了HierarCaps数据集——一个通过大语言模型自动构建、用于促进图像-文本表征中层次知识研究的基准测试集。实验结果表明,基础VLM模型展现出零样本层次理解能力,其性能超越了先前专门为此目标设计的模型。此外,我们证明基础模型可通过纯文本微调阶段更好地与层次推理对齐,同时保持预训练获得的知识。