The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
翻译:精确解读复杂视觉信息的能力是多模态大语言模型(MLLMs)的核心课题。近期研究表明,增强的视觉感知能力能显著减少幻觉现象,并提升在分辨率敏感任务(如光学字符识别与文档分析)上的性能。当前已有多种MLLMs通过混合视觉编码器实现这一目标。尽管这些模型取得了成功,但针对专家选择、多视觉专家融合等关键设计要素,仍缺乏系统性的比较与详尽的消融研究。本研究对采用混合视觉编码器与多分辨率策略的MLLMs设计空间进行了全面探索。我们的研究揭示了多种现有策略背后共有的基础原则,并由此提出了一种精简而高效的设计方案。我们发现,仅需将来自一组互补视觉编码器的视觉特征进行简单拼接,其效果即可媲美更复杂的混合架构或策略。此外,我们引入了预对齐机制以弥合视觉导向编码器与语言特征之间的语义鸿沟,从而提升模型的一致性。基于此构建的MLLMs系列模型——Eagle,在主流多模态大语言模型评测基准上超越了其他领先的开源模型。