Emergent Visual-Semantic Hierarchies in Image-Text Representations

While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image--text representations, constructed automatically via large language models. Our results show that foundation VLMs exhibit zero-shot hierarchical understanding, surpassing the performance of prior models explicitly designed for this purpose. Furthermore, we show that foundation models may be better aligned to hierarchical reasoning via a text-only fine-tuning phase, while retaining pretraining knowledge.

翻译：尽管近期如CLIP等视觉-语言模型（VLMs）为在共享语义空间中分析文本与图像提供了强大工具，但这些模型并未显式建模描述图像时可能存在的文本集合的层次化特性。反之，现有的多模态层次表征学习方法需要从头开始进行成本高昂的训练，未能充分利用最先进多模态基础模型所编码的知识。在本研究中，我们探究了现有基础模型的知识结构，发现尽管未直接针对此目标进行训练，它们仍展现出对视觉-语义层次结构的涌现式理解。我们提出径向嵌入（RE）框架用于探测和优化层次理解能力，并构建了HierarCaps数据集——一个通过大语言模型自动构建、用于促进图像-文本表征中层次知识研究的基准测试集。实验结果表明，基础VLM模型展现出零样本层次理解能力，其性能超越了先前专门为此目标设计的模型。此外，我们证明基础模型可通过纯文本微调阶段更好地与层次推理对齐，同时保持预训练获得的知识。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日