Large Multimodal Models (LMMs) have shown significant reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically use a fixed amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which increase the number of visual tokens significantly. However, due to the design of the Transformer architecture, computational costs associated with these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism and find, similar to prior work, that many visual tokens are spatially redundant. Based on this, we propose PruMerge, a novel adaptive visual token reduction approach, which largely reduces the number of visual tokens while maintaining comparable model performance. We first select the unpruned visual tokens based on their similarity to class tokens and spatial tokens. We then cluster the pruned tokens based on key similarity and merge the clustered tokens with the unpruned tokens to supplement their information. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 18 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/.
翻译:大型多模态模型通过连接视觉编码器与大型语言模型展现出显著的推理能力。这类模型通常使用固定数量的视觉令牌(如CLIP视觉编码器倒数第二层特征)作为前缀内容。近期大型多模态模型开始整合更复杂的视觉输入(如高分辨率图像与视频),导致视觉令牌数量显著增加。然而受Transformer架构设计影响,相关模型的计算成本随输入令牌数量呈二次方增长。为解决此问题,我们探索了令牌缩减机制,并发现与先前研究类似——大量视觉令牌存在空间冗余性。基于此,我们提出PruMerge这一新型自适应视觉令牌缩减方法,在显著减少视觉令牌数量的同时保持可比模型性能。该方法首先根据视觉令牌与类别令牌及空间令牌的相似度筛选未剪枝令牌,随后基于关键相似度对已剪枝令牌进行聚类,并将其与未剪枝令牌融合以补充信息。实验表明,当应用于LLaVA-1.5模型时,本方法可将视觉令牌平均压缩18倍,并在多种视觉问答与推理任务中达到可比性能。代码与模型权重见https://llava-prumerge.github.io/。