In this paper, we introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities. However, the substantial redundancy in video data presents significant computational challenges for LLMs. To address this issue, we introduce a training-free method that 1) minimizes video redundancy by merging spatial-temporal tokens, and 2) leverages LLMs' reasoning capabilities to selectively prune visual features relevant to question tokens, enhancing model efficiency. We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens while maintaining competitive performance combined with different model networks. This highlights its superior effectiveness and efficiency compared to existing pruning methods. Code: https://github.com/Visual-AI/PruneVid.
翻译:本文提出PruneVid,一种旨在提升多模态视频理解效率的视觉令牌剪枝方法。大语言模型凭借其在理解视觉模态方面的扩展能力,在视频任务中展现出良好性能。然而,视频数据中存在的显著冗余为大语言模型带来了巨大的计算挑战。为解决此问题,我们提出一种无需训练的方法,该方法能够:1)通过融合时空令牌来最小化视频冗余;2)利用大语言模型的推理能力,选择性剪枝与问题令牌相关的视觉特征,从而提升模型效率。我们在多个视频基准测试上验证了所提方法,结果表明PruneVid能够在与不同模型网络结合时,剪枝超过80%的令牌同时保持有竞争力的性能。这凸显了其相较于现有剪枝方法在效能与效率方面的优越性。代码:https://github.com/Visual-AI/PruneVid。