MoVA: Adapting Mixture of Vision Experts to Multimodal Context

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM) equipped with expert-routing low-rank adaptation (LoRA). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks. Codes and models will be available at https://github.com/TempleX98/MoVA.

翻译：作为多模态大语言模型（MLLMs）的关键组成部分，视觉编码器的能力极大影响着MLLM对多样图像内容的理解。尽管CLIP和DINOv2中的视觉编码器等大规模预训练视觉编码器已展现出卓越性能，但我们发现目前尚不存在能主导各类图像内容理解的单一视觉编码器——例如，CLIP视觉编码器在通用图像理解任务中表现突出，但在文档或图表内容处理上效果欠佳。为缓解CLIP视觉编码器的偏差，我们首先深入探究不同预训练视觉编码器的内在特性，进而提出MoVA这一强大且新颖的MLLM，通过"由粗到精"机制自适应路由与融合任务特定视觉专家。在粗粒度阶段，我们设计了一种上下文感知的专家路由策略，根据用户指令、输入图像及视觉专家专长动态选择最适配的视觉专家。这得益于配备专家路由低秩适配（LoRA）的大语言模型（LLM）对模型功能的强大理解能力。在细粒度阶段，我们精心构建了混合视觉专家适配器（MoV-Adapter），用于提取并融合各专家的任务特定知识。这种由粗到精的范式基于多模态上下文和模型专长有效利用专家表征，进一步增强了泛化能力。我们通过大量实验验证了所提方法的有效性。在毫无额外修饰的情况下，MoVA在多个挑战性多模态基准测试中均实现了超越当前最先进方法的显著性能提升。代码与模型将开源至https://github.com/TempleX98/MoVA。