Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper

翻译：在评估和反思当前大型多模态模型（LMMs）的发展现状时，我们观察到广泛使用的视觉-语言投影方法（如Q-former或MLP）侧重于图像-文本描述的对齐，却忽视了视觉知识维度的对齐，即建立视觉内容与其相关知识之间的联系。视觉知识在分析、推理和解释视觉信息方面发挥着重要作用，有助于提高基于知识的视觉问题回答的准确性。本文主要探索通过视觉-语言知识对齐来改进LMMs，尤其针对具有挑战性的基于知识的视觉问答（VQA）。为此，我们提出了认知视觉-语言映射器（CVLM），它包含一个预训练的视觉知识对齐器（VKA）和一个用于多模态指令微调阶段的细粒度知识适配器（FKA）。具体而言，我们基于小型语言模型与视觉编码器之间的交互设计VKA，在收集的图像-知识对上对其进行训练，以实现视觉知识的获取与投影。FKA则用于提取图像的细粒度视觉知识并将其注入大型语言模型（LLMs）。我们在基于知识的VQA基准上进行了大量实验，结果表明CVLM显著提升了LMMs在基于知识的VQA上的性能（平均提升5.0%）。消融研究也分别验证了VKA和FKA的有效性。代码可在https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper获取。