MoVA: Adapting Mixture of Vision Experts to Multimodal Context

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.

翻译：作为多模态大语言模型（MLLM）的核心组件，视觉编码器的能力极大地影响着MLLM对多样化图像内容的理解。尽管CLIP与DINOv2等大规模预训练视觉编码器已展现出优异性能，但我们发现目前仍不存在能够主导各类图像内容理解的单一视觉编码器——例如，CLIP视觉编码器在通用图像理解上表现卓越，却在文档或图表内容上效果欠佳。为缓解CLIP视觉编码器的偏差，我们首先深入探究了不同预训练视觉编码器的内在特性，进而提出了MoVA：一种通过由粗到精机制自适应路由与融合任务专属视觉专家的强大新型MLLM。在粗粒度阶段，我们设计了上下文感知的专家路由策略，依据用户指令、输入图像及视觉专家的专长动态选择最适配的视觉专家，这得益于大语言模型（LLM）强大的模型功能理解能力。在细粒度阶段，我们精心构建了视觉专家混合适配器（MoV-Adapter），以从各专家中提取并融合任务专属知识。这种由粗到精的范式基于多模态语境与模型专长有效整合了来自不同专家的表征，进一步增强了泛化能力。我们通过大量实验评估了所提方法的有效性。在不引入任何冗余设计的情况下，MoVA在广泛的多模态基准测试中均能较当前最优方法取得显著性能提升。