Large Multimodal Models (LMMs) have shown significant reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically use a fixed amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which increase the number of visual tokens significantly. However, due to the design of the Transformer architecture, computational costs associated with these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism and find, similar to prior work, that many visual tokens are spatially redundant. Based on this, we propose PruMerge, a novel adaptive visual token reduction approach, which largely reduces the number of visual tokens while maintaining comparable model performance. We first select the unpruned visual tokens based on their similarity to class tokens and spatial tokens. We then cluster the pruned tokens based on key similarity and merge the clustered tokens with the unpruned tokens to supplement their information. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 14.4 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/.
翻译:大型多模态模型通过连接视觉编码器与大型语言模型展现出显著的推理能力。这类模型通常采用固定数量的视觉令牌(如CLIP视觉编码器的倒数第二层特征)作为前缀内容。近期的大规模多模态模型引入了更复杂的视觉输入(如高分辨率图像和视频),导致视觉令牌数量显著增加。然而,由于Transformer架构的设计特性,这些模型的计算成本会随输入令牌数量呈二次方增长。为解决该问题,我们探索了令牌缩减机制,并发现与前期研究类似,大量视觉令牌存在空间冗余。基于此,我们提出PruMerge——一种新型自适应视觉令牌缩减方法,可在保持模型性能的同时大幅减少视觉令牌数量。该方法首先根据视觉令牌与类别令牌及空间令牌的相似性筛选出未剪枝令牌,随后通过键相似性对已剪枝令牌进行聚类,并将聚类结果与未剪枝令牌融合以补充信息。实验表明,将该方法应用于LLaVA-1.5模型时,平均可压缩14.4倍视觉令牌,并在多种视觉问答与推理任务中取得相当性能。代码与模型权重见https://llava-prumerge.github.io/。