Most multimodal large language models (MLLMs) treat visual tokens as "a sequence of text", integrating them with text tokens into a large language model (LLM). However, a great quantity of visual tokens significantly increases the demand for computational resources and time. In this paper, we propose InternVL-X, which outperforms the InternVL model in both performance and efficiency by incorporating three visual token compression methods. First, we propose a novel vision-language projector, PVTC. This component integrates adjacent visual embeddings to form a local query and utilizes the transformed CLS token as a global query, then performs point-to-region cross-attention through these local and global queries to more effectively convert visual features. Second, we present a layer-wise visual token compression module, LVTC, which compresses tokens in the LLM shallow layers and then expands them through upsampling and residual connections in the deeper layers. This significantly enhances the model computational efficiency. Futhermore, we propose an efficient high resolution slicing method, RVTC, which dynamically adjusts the number of visual tokens based on image area or length filtering. RVTC greatly enhances training efficiency with only a slight reduction in performance. By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.
翻译:大多数多模态大语言模型(MLLMs)将视觉令牌视为“文本序列”,并将其与文本令牌一同输入大语言模型(LLM)。然而,大量视觉令牌显著增加了计算资源和时间需求。本文提出InternVL-X,通过引入三种视觉令牌压缩方法,在性能与效率上均超越了InternVL模型。首先,我们提出一种新颖的视觉语言投影器PVTC。该组件整合相邻视觉嵌入以形成局部查询,并利用转换后的CLS令牌作为全局查询,随后通过这些局部与全局查询执行点对区域交叉注意力,以更有效地转换视觉特征。其次,我们提出层级视觉令牌压缩模块LVTC,该模块在LLM浅层压缩令牌,随后在深层通过上采样和残差连接进行扩展,显著提升了模型计算效率。此外,我们提出一种高效高分辨率切片方法RVTC,该方法基于图像面积或长度过滤动态调整视觉令牌数量。RVTC在仅轻微降低性能的情况下大幅提升了训练效率。通过使用20%或更少的视觉令牌,InternVL-X在7个公开MLLM基准测试中取得了最先进的性能,并在12项任务上将平均指标提升了2.34%。