LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

翻译：多模态大语言模型使大语言模型具备了感知和理解多模态信号的能力。然而，现有的大多数多模态大语言模型主要采用在粗粒度图文对齐对上预训练的视觉编码器，导致视觉知识的提取与推理不充分。为解决这一问题，我们设计了双层视觉知识增强多模态大语言模型（LION），通过注入两个层次的视觉知识来赋能多模态大语言模型：1）渐进式融入细粒度空间感知视觉知识。我们设计了一个与区域级视觉-语言任务协同的视觉聚合器，将细粒度空间感知视觉知识融入多模态大语言模型。为缓解融合过程中图像级与区域级视觉-语言任务的冲突，我们提出了一种结合混合适配器的分阶段指令微调策略。这种渐进式融合方案促进了两类视觉-语言任务的相互提升。2）高层语义视觉证据的软提示。我们通过利用多样化图像标签为多模态大语言模型注入高层语义视觉证据。为缓解预测标签不完美带来的潜在影响，我们提出了一种软提示方法，通过在定制文本指令中嵌入可学习令牌来实现。在多个多模态基准上的全面实验证明了我们模型的优越性（例如，在VSR上准确率提升5%，在TextCaps上CIDEr指标提升3%优于InstructBLIP，在RefCOCOg上准确率提升5%优于Kosmos-2）。