In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.
翻译:近年来,多模态大语言模型取得了显著进展,这主要归功于整合视觉与文本信息的有效范式。主流基于连接器的范式将视觉特征投影到文本序列中,使得在生成架构内实现统一的多模态对齐与推理成为可能。然而,我们的实验揭示了两个关键局限性:(1) 尽管视觉信息作为多模态大语言模型的核心证据模态,但它被与文本令牌等同对待,从而削弱了视觉模态的独特贡献;(2) 随着生成长度的增加,特别是在有限的上下文窗口内,模型对视觉信息的依赖逐渐减弱,导致视觉-语言对齐性能下降,生成内容与视觉语义之间的一致性降低。为应对这些挑战,我们提出视觉推理前馈网络——一种轻量级架构模块,它在纯视觉表示与模型输出空间之间建立直接桥梁。具体而言,视觉推理前馈网络在推理过程的解码阶段持续注入视觉语义,确保模型在生成过程中始终牢固地基于视觉内容。我们在涵盖通用推理、OCR、表格理解、视觉中心评估及幻觉等任务的14个基准上开展实验。实验结果表明,视觉推理前馈网络在引入极小额外开销的同时,能够跨越不同架构持续提升模型性能。本工作的代码开源于https://github.com/Dong-Xinpeng/VIF。