The visual projector, which bridges the vision and language modalities and facilitates cross-modal alignment, serves as a crucial component in MLLMs. However, measuring the effectiveness of projectors in vision-language alignment remains under-explored, which currently can only be inferred from the performance of MLLMs on downstream tasks. Motivated by the problem, this study examines the projector module by interpreting the vision-language semantic flow within MLLMs. Specifically, we trace back the semantic relevance flow from generated language tokens to raw visual encoder patches and the intermediate outputs produced by projectors. Our findings reveal that compressive projectors (e.g., QFormer), abstract visual patches into a limited set of semantic concepts, such as objects or attributes, resulting in a 'double abstraction' phenomenon. This involves a first visual semantic abstraction by the projector referring to pre-defined query tokens, and a second extraction by the LLM based on text instructions. The double abstraction is inefficient in training and will result in cumulative vision semantics deficiency. To mitigate this issue, we propose the key insight of 'Decouple Compression from Abstraction (DeCo), that is compressing the visual token number at the patch level by projectors and allowing the LLM to handle visual semantic abstraction entirely. Consequently, we adopt a simple compressor, i.e., 2D Adaptive Pooling, to downsample visual patches in a parameter-free manner. Empirical evaluation demonstrates that DeCo surpasses traditional compressive projectors regarding both performance and efficiency. It achieves performance gains of 0.9%, 7.1%, and 2.9% across the MLLM Benchmarks, Visual Localization, and Open-ended VQA tasks with fewer trainable parameters and faster convergence speed.
翻译:视觉投影器作为连接视觉与语言模态、促进跨模态对齐的关键组件,在多模态大语言模型(MLLMs)中发挥着至关重要的作用。然而,如何有效衡量投影器在视觉-语言对齐中的作用仍未得到充分探索,目前仅能通过MLLMs在下游任务中的表现间接推断。受此问题启发,本研究通过解译MLLMs内部的视觉-语言语义流,对投影器模块进行深入剖析。具体而言,我们逆向追溯从生成的语言令牌到原始视觉编码器图像块以及投影器产生的中间输出的语义关联流。研究发现,压缩型投影器(如QFormer)会将视觉图像块抽象为有限的一组语义概念(例如物体或属性),从而导致“双重抽象”现象。该过程包含两个阶段:首先由投影器参照预定义查询令牌进行视觉语义抽象,随后由大语言模型基于文本指令进行二次提取。这种双重抽象在训练中效率低下,并会导致视觉语义的累积性缺失。为缓解此问题,我们提出“解耦压缩与抽象”的核心思想,即由投影器在图像块层面完成视觉令牌的数量压缩,而将视觉语义抽象完全交由大语言模型处理。基于此,我们采用一种简单的压缩器——二维自适应池化,以无参数的方式对视觉图像块进行下采样。实证评估表明,DeCo在性能与效率方面均优于传统压缩型投影器。在MLLM基准测试、视觉定位和开放式视觉问答任务中,DeCo分别实现了0.9%、7.1%和2.9%的性能提升,同时具有更少的可训练参数和更快的收敛速度。