Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models' performance on average. Our code is available at https://github.com/EmbodiedCity/NeurIPS2025-Balanced-Token-Pruning.
翻译:大型视觉语言模型(LVLMs)通过将图像编码为数千个令牌,在多模态任务中展现出卓越性能。然而,大量图像令牌会导致显著的计算开销,而动态高分辨率输入的使用进一步加剧了这一负担。现有方法通常通过注意力分数或图像令牌多样性来选择令牌,以尝试减少图像令牌数量。通过实证研究,我们发现现有方法往往忽视剪枝对当前层输出(局部)和后续层输出(全局)的联合影响,导致剪枝决策次优。为解决这一挑战,我们提出平衡令牌剪枝(BTP),一种即插即用的视觉令牌剪枝方法。具体而言,我们的方法利用小型校准集将剪枝过程划分为多个阶段:在早期阶段强调剪枝对后续层的影响,而在深层阶段则侧重于保持局部输出的一致性。跨多种LVLMs的广泛实验表明,该方法在多个基准测试中具有普遍有效性。我们的方法在平均保留原始模型96.7%性能的同时,实现了78%的压缩率。代码发布于https://github.com/EmbodiedCity/NeurIPS2025-Balanced-Token-Pruning。