The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: https://github.com/NVlabs/Eagle
翻译:准确解读复杂视觉信息的能力是多模态大语言模型(MLLM)的核心课题。近期研究表明,增强的视觉感知能力能显著减少幻觉现象,并提升在分辨率敏感任务(如光学字符识别与文档分析)上的性能。当前已有多种MLLM通过混合视觉编码器实现这一目标。尽管这些方法取得了成功,但在专家选择与多视觉专家融合等关键设计维度上,仍缺乏系统性的比较与详尽的消融研究。本研究对采用视觉编码器与分辨率混合策略的MLLM设计空间进行了全面探索。我们的研究发现,现有多种策略背后存在若干共通的基本原则,这些原则可导向一种精简而高效的设计范式。实验表明,仅需将来自一组互补视觉编码器的视觉特征进行拼接,其效果即可媲美更复杂的混合架构或策略。此外,我们提出了预对齐机制,以弥合视觉导向编码器与语言特征之间的鸿沟,从而提升模型的一致性。基于此构建的MLLM系列——Eagle,在主流多模态基准测试中超越了其他领先的开源模型。模型与代码:https://github.com/NVlabs/Eagle