From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Multi-modal Large Language Models (MLLMs) have made significant strides in expanding the capabilities of Large Language Models (LLMs) through the incorporation of visual perception interfaces. Despite the emergence of exciting applications and the availability of diverse instruction tuning data, existing approaches often rely on CLIP or its variants as the visual branch, and merely extract features from the deep layers. However, these methods lack a comprehensive analysis of the visual encoders in MLLMs. In this paper, we conduct an extensive investigation into the effectiveness of different vision encoders within MLLMs. Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding. Surprisingly, the vision-only model DINO, which is not pretrained with text-image alignment, demonstrates promising performance as a visual branch within MLLMs. By simply equipping it with an MLP layer for alignment, DINO surpasses CLIP in fine-grained related perception tasks. Building upon these observations, we propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging, to enhance the visual capabilities of MLLMs. We evaluate COMM through comprehensive experiments on a wide range of benchmarks, including image captioning, visual question answering, visual grounding, and object hallucination. Experimental results demonstrate the superior performance of COMM compared to existing methods, showcasing its enhanced visual capabilities within MLLMs. Code will be made available at https://github.com/YuchenLiu98/COMM.

翻译：多模态大语言模型（MLLMs）通过集成视觉感知接口，显著扩展了大语言模型（LLMs）的能力。尽管出现了令人兴奋的应用以及多样化的指令微调数据，现有方法通常依赖CLIP或其变体作为视觉分支，且仅从深层提取特征。然而，这些方法缺乏对MLLMs中视觉编码器的全面分析。本文系统研究了不同视觉编码器在MLLMs中的有效性。我们的发现表明，CLIP的浅层特征在细粒度任务（如定位和区域理解）中具有独特优势。令人惊讶的是，未经过文本-图像对齐预训练的纯视觉模型DINO，在作为MLLMs的视觉分支时展现出颇具前景的性能。只需为其配备一个用于对齐的MLP层，DINO在细粒度感知任务中即可超越CLIP。基于这些观察，我们提出一种简单而有效的特征融合策略——COMM（多层级特征合并的CLIP与DINO整合），通过融合CLIP和DINO的多层级特征来增强MLLMs的视觉能力。我们在图像描述、视觉问答、视觉定位和物体幻觉等广泛基准上对COMM进行了全面实验。实验结果表明，与现有方法相比，COMM展现出卓越性能，彰显其在MLLMs中增强的视觉能力。代码将发布于https://github.com/YuchenLiu98/COMM。