Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language model. However, the increased token count substantially raises computational and memory costs during inference. Token pruning has emerged as a promising approach to address this issue. Existing token pruning methods often rely on costly calibration or suboptimal importance metrics, leading to redundant retained tokens. In this paper, we analyze the redundancy differences between visual and textual tokens and propose pruning exclusively on visual tokens. Based on this, we propose a visual token pruning strategy that explicitly preserves both cross-modal alignment and intra-modal informational diversity. We introduce a mutual information-based token pruning strategy that removes visual tokens semantically misaligned with textual tokens, effectively preserving the alignment between the visual and textual modalities. To further improve the representational quality of the retained tokens, we additionally prune redundant visual tokens by maximizing the expected pairwise distances in the embedding space, which is solved efficiently with a greedy algorithm. Extensive experiments demonstrate that our method maintains strong performance while reducing tokens by 88.9% on models such as LLaVA-1.5-7B and LLaVA-NEXT-7B, resulting in a 56.7% improvement in inference speed.
翻译:大型多模态模型(LMMs)已在多种任务中取得显著成功。这类模型通常将视觉输入编码为密集的令牌序列,随后与文本令牌拼接,并由语言模型进行联合处理。然而,令牌数量的增加会显著推高推理过程中的计算与内存开销。令牌剪枝已成为解决该问题的有效途径。现有令牌剪枝方法往往依赖高成本的校准或次优的重要性度量,导致保留令牌存在冗余。本文分析了视觉令牌与文本令牌之间的冗余差异,并提出仅对视觉令牌进行剪枝的策略。在此基础上,我们提出一种视觉令牌剪枝方法,该方法明确保持跨模态对齐与模态内信息多样性。我们引入基于互信息的令牌剪枝策略,通过移除与文本令牌语义失配的视觉令牌,有效保持视觉与文本模态间的对齐关系。为进一步提升保留令牌的表征质量,我们通过最大化嵌入空间中令牌对的期望距离来剪除冗余视觉令牌,该问题可通过贪心算法高效求解。大量实验表明,我们的方法在LLaVA-1.5-7B和LLaVA-NEXT-7B等模型上能减少88.9%的令牌数量,同时保持强劲性能,推理速度提升达56.7%。