The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic segments after the linear projection layer before feeding into the vision encoder. Besides, with the isolated attention we adopt, VisToG can identify and eliminate redundant visual tokens utilizing the prior knowledge in the pre-trained vision encoder, which effectively reduces computational demands. Extensive experiments demonstrate the effectiveness of VisToG, maintaining 98.1% of the original performance while achieving a reduction of over 27\% inference time.
翻译:多模态大语言模型的发展赋予了大语言模型超越文本的数据感知能力,显著推动了视觉问答和图像描述等一系列下游应用的进步。然而,处理高分辨率图像和视频所产生的巨大计算成本阻碍了其更广泛的应用。为应对这一挑战,压缩多模态大语言模型中的视觉令牌已成为降低推理成本的有效途径。现有方法主要在特征对齐阶段进行令牌约减。本文提出VisToG——一种新颖的分组机制,该机制利用预训练视觉编码器的能力对相似图像片段进行分组,且无需分割掩码。具体而言,我们在线性投影层后拼接语义令牌以表征图像语义片段,再将其输入视觉编码器。此外,通过采用隔离注意力机制,VisToG能够利用预训练视觉编码器中的先验知识识别并消除冗余视觉令牌,从而有效降低计算需求。大量实验证明VisToG在保持原始性能98.1%的同时,实现了超过27%的推理时间缩减。