CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

翻译：最近的视觉-语言模型（VLM）通常依赖单个使用对比图像-文本目标训练的视觉编码器，例如CLIP风格的预训练。尽管对比编码器在跨模态对齐和检索方面有效，但自监督视觉编码器通常能捕获更丰富的稠密语义，并在识别和理解任务上表现出更强的鲁棒性。本文研究了如何缩放这些互补视觉表示的融合以用于视觉-语言建模。我们提出CoME-VL：互补多编码器视觉-语言（Complementary Multi-Encoder Vision-Language），一种模块化融合框架，整合了对比训练视觉编码器与自监督DINO编码器。该方法通过（i）基于熵引导的多层聚合与正交约束投影以减少冗余，以及（ii）利用RoPE增强的交叉注意力对齐异构令牌网格并生成紧凑的融合视觉令牌，实现了表示级融合。融合令牌可注入仅解码器的大型语言模型，且对标准VLM流水线的修改极小。跨多种视觉-语言基准的大量实验表明，CoME-VL持续优于单编码器基线。特别地，在视觉理解任务和定位任务上分别观察到平均提升4.9%和5.4%。该方法在RefCOCO检测任务上达到了最先进性能，同时以较大幅度改进基线。最后，我们通过层合并、非冗余特征混合及融合能力的消融研究，评估了互补对比信号与自监督信号如何影响VLM性能。