CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.

翻译：人类的一项显著能力在于组合推理，即“有限手段的无限使用”能力。然而，当前大型视觉-语言基础模型（VLM）由于存在“词袋”行为且无法构建正确表示视觉实体及实体间关系的词语，因此缺乏此类组合能力。为此，我们提出CoVLM，该模型可引导大语言模型（LLM）显式地在文本中组合视觉实体与关系，并通过与视觉编码器和检测网络的动态通信实现视觉-语言通信解码。具体而言，我们首先为LLM设计了一组新型通信令牌，用于视觉检测系统与语言系统之间的动态交互。这些通信令牌由LLM在视觉实体或关系之后生成，以告知检测网络提出与当前已生成句子相关的候选区域。随后，这些感兴趣区域（ROI）被反馈至LLM，使其能够基于相关区域生成更优的语言输出。通过通信令牌，LLM得以组合视觉实体与关系。视觉到语言及语言到视觉的通信迭代执行，直至生成完整句子。我们的框架无缝弥合了视觉感知与LLM之间的鸿沟，在组合推理基准测试中大幅超越现有VLM（例如HICO-DET mAP提升约20%，Cola top-1准确率提升约14%，ARO top-1准确率提升约3%）。在指代表达理解与视觉问答等传统视觉-语言任务中，我们也取得了最优性能。