Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi encoder MLLMs, we find that performance typically degrades gracefully, and sometimes even improves, when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoder s marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe: (i) strong specialization on tasks like OCR and Chart, where a single encoder can dominate with a CUR greater than 90 percent, (ii) high redundancy on general VQA and knowledge based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16 percent higher accuracy on a specific task category and 3.6 percent overall performance boost compared to the full model.Furthermore, single and dual encoder variants recover over 90 percent of baseline on most non OCR tasks with substantially lower training resources and inference latency. Our analysis challenges the more encoders are better heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.

翻译：近年来，多模态大语言模型（MLLMs）日益集成多个视觉编码器，以期通过多样化的预训练目标获得互补的视觉信号，从而提升在各种基准测试上的性能。然而，我们发现这一假设在实践中往往并不成立。通过对代表性的多编码器 MLLMs 进行系统的编码器掩蔽实验，我们发现当特定编码器被掩蔽时，模型性能通常仅平缓下降，有时甚至有所提升，这揭示了编码器之间普遍存在的冗余现象。为量化这一效应，我们引入了两个原则性度量指标：条件利用率（CUR），用于衡量一个编码器在其他编码器存在时的边际贡献；以及信息差（IG），用于捕捉模型内各编码器效用的异质性。利用这些工具，我们观察到：（i）在 OCR 和图表等任务上存在强烈的专业化现象，单个编码器可能占据主导地位，其 CUR 超过 90%；（ii）在通用视觉问答（VQA）和基于知识的任务上存在高度冗余，编码器在很大程度上可相互替换；（iii）存在具有负 CUR 的有害编码器实例。值得注意的是，与完整模型相比，掩蔽特定编码器能在特定任务类别上获得高达 16% 的准确率提升，以及 3.6% 的整体性能提升。此外，单编码器和双编码器变体在大多数非 OCR 任务上能以显著更低的训练资源和推理延迟恢复超过 90% 的基线性能。我们的分析挑战了 MLLMs 中“编码器越多越好”的经验法则，并为开发更高效、更有效的多模态架构提供了可操作的诊断方法。