Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing methods are generally limited, either losing critical visual information prematurely due to pruning in the vision encoder, or leading to information redundancy among the selected tokens due to pruning in the Large Language Models (LLMs). To address these challenges, we propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the LLM based on its hierarchical characteristics, to efficiently preserve critical and informationally diverse visual tokens. Meanwhile, to ensure compatibility with acceleration techniques like FlashAttention, we introduce the L2 norm of K-vectors as the token saliency metric in the LLM. Extensive experiments on various Large Vision-Language Models demonstrate that ViTCoP not only achieves state-of-the-art performance surpassing existing methods on both image and video understanding tasks, but also significantly reduces model inference latency and GPU memory consumption. Notably, its performance advantage over other methods becomes even more pronounced under extreme pruning rates.
翻译:大型视觉语言模型(LVLMs)因其视觉标记中存在显著冗余而导致高昂的计算成本。为有效降低该成本,研究者们提出了多种视觉标记剪枝方法。然而,现有方法普遍存在局限:要么因在视觉编码器中进行剪枝而过早丢失关键视觉信息,要么因在大型语言模型(LLMs)内进行剪枝而导致所选标记间存在信息冗余。为解决这些挑战,我们提出了一种视觉与文本语义协同剪枝框架(ViTCoP),该框架结合了视觉编码器中的冗余过滤与基于LLM层次化特性的逐步协同剪枝,以高效保留关键且信息多样化的视觉标记。同时,为确保与FlashAttention等加速技术的兼容性,我们引入K向量L2范数作为LLM中的标记显著性度量指标。在各种大型视觉语言模型上的大量实验表明,ViTCoP不仅在图像和视频理解任务上实现了超越现有方法的先进性能,而且显著降低了模型推理延迟和GPU内存消耗。值得注意的是,在极端剪枝率下,其相对于其他方法的性能优势变得更为显著。