Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.

翻译：近年来，多模态大语言模型取得了显著进展，这主要归功于整合视觉与文本信息的有效范式。主流基于连接器的范式将视觉特征投影到文本序列中，使得在生成架构内实现统一的多模态对齐与推理成为可能。然而，我们的实验揭示了两个关键局限性：(1) 尽管视觉信息作为多模态大语言模型的核心证据模态，但它被与文本令牌等同对待，从而削弱了视觉模态的独特贡献；(2) 随着生成长度的增加，特别是在有限的上下文窗口内，模型对视觉信息的依赖逐渐减弱，导致视觉-语言对齐性能下降，生成内容与视觉语义之间的一致性降低。为应对这些挑战，我们提出视觉推理前馈网络——一种轻量级架构模块，它在纯视觉表示与模型输出空间之间建立直接桥梁。具体而言，视觉推理前馈网络在推理过程的解码阶段持续注入视觉语义，确保模型在生成过程中始终牢固地基于视觉内容。我们在涵盖通用推理、OCR、表格理解、视觉中心评估及幻觉等任务的14个基准上开展实验。实验结果表明，视觉推理前馈网络在引入极小额外开销的同时，能够跨越不同架构持续提升模型性能。本工作的代码开源于https://github.com/Dong-Xinpeng/VIF。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

大视觉语言模型的高效推理：瓶颈剖析、关键技术与未来展望

专知会员服务

17+阅读 · 4月11日

从感知到推理：深度思考赋能多模态大语言模型

专知会员服务

26+阅读 · 2025年11月19日

从感知到认知：多模态大语言模型中视觉-语言交互推理综述

专知会员服务

32+阅读 · 2025年10月1日