From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Multi-modal Large Language Models (MLLMs) have made significant strides in expanding the capabilities of Large Language Models (LLMs) through the incorporation of visual perception interfaces. Despite the emergence of exciting applications and the availability of diverse instruction tuning data, existing approaches often rely on CLIP or its variants as the visual branch, and merely extract features from the deep layers. However, these methods lack a comprehensive analysis of the visual encoders in MLLMs. In this paper, we conduct an extensive investigation into the effectiveness of different vision encoders within MLLMs. Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding. Surprisingly, the vision-only model DINO, which is not pretrained with text-image alignment, demonstrates promising performance as a visual branch within MLLMs. By simply equipping it with an MLP layer for alignment, DINO surpasses CLIP in fine-grained related perception tasks. Building upon these observations, we propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging, to enhance the visual capabilities of MLLMs. We evaluate COMM through comprehensive experiments on a wide range of benchmarks, including image captioning, visual question answering, visual grounding, and object hallucination. Experimental results demonstrate the superior performance of COMM compared to existing methods, showcasing its enhanced visual capabilities within MLLMs. Code will be made available at https://github.com/YuchenLiu98/COMM.

翻译：多模态大语言模型（MLLMs）通过引入视觉感知接口，显著扩展了大语言模型（LLMs）的能力。尽管涌现出令人兴奋的应用场景和多样化的指令微调数据，现有方法通常依赖CLIP或其变体作为视觉分支，且仅从深层提取特征。然而，这些方法缺乏对MLLMs中视觉编码器的系统性分析。本文对MLLMs中不同视觉编码器的有效性进行了广泛研究。我们的发现表明：CLIP的浅层特征在细粒度任务（如定位和区域理解）中具有独特优势；令人惊讶的是，未经文本-图像对齐预训练的纯视觉模型DINO，在作为MLLMs的视觉分支时展现出有前景的性能。仅通过添加MLP层进行对齐适配，DINO在细粒度感知相关任务中便超越了CLIP。基于这些观察，我们提出一种简单而有效的特征融合策略——COMM（多层级特征融合的CLIP与DINO联合模型），通过融合CLIP和DINO的多层级特征来增强MLLMs的视觉能力。我们在图像描述、视觉问答、视觉定位和对象幻觉等广泛基准上进行了全面实验。结果表明，COMM相较于现有方法展现出卓越性能，充分体现了其增强MLLMs视觉能力的优势。代码将在https://github.com/YuchenLiu98/COMM 开源。