Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi encoder MLLMs, we find that performance typically degrades gracefully and sometimes even improves when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoders marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe (i) strong specialization on tasks like OCR and Chart, where a single encoder can dominate with a CUR greater than 90%, (ii) high redundancy on general VQA and knowledge-based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16% higher accuracy on a specific task category and 3.6% overall performance boost compared to the full model.Furthermore, single and dual encoder variants recover over 90% of baseline on most non OCR tasks. Our analysis challenges the more encoders are better heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.
翻译:近年来,多模态大语言模型(MLLMs)日益集成多个视觉编码器,以期通过多样化的预训练目标获得互补的视觉信号,从而提升在各种基准测试上的性能。然而,我们发现这一假设在实践中往往并不成立。通过对代表性的多编码器MLLMs进行系统的编码器掩蔽实验,我们发现,当掩蔽特定编码器时,模型性能通常仅出现平缓下降,有时甚至有所提升,这揭示了编码器之间普遍存在的冗余现象。为量化这一效应,我们引入了两个原则性指标:条件利用率(CUR),用于衡量某个编码器在其他编码器存在时的边际贡献;以及信息差距(IG),用于捕捉模型内各编码器效用的异质性。利用这些工具,我们观察到:(i)在OCR和图表等任务上存在强烈的专业化现象,单个编码器可能占据主导地位,其CUR超过90%;(ii)在通用视觉问答(VQA)和基于知识的任务上存在高度冗余,编码器在很大程度上可相互替换;(iii)存在具有负CUR的有害编码器实例。值得注意的是,与完整模型相比,掩蔽特定编码器可在特定任务类别上获得高达16%的准确率提升,整体性能提升达3.6%。此外,单编码器及双编码器变体在大多数非OCR任务上可恢复超过90%的基线性能。我们的分析挑战了MLLMs中“编码器越多越好”的经验法则,并为开发更高效、更有效的多模态架构提供了可操作的诊断工具。