Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.
翻译:近期,视觉标记剪枝技术被研究用于处理多模态大语言模型中大量视觉标记的挑战。然而,我们观察到,现有剪枝方法在简单视觉理解任务中表现可靠,但在复杂视觉推理任务中难以有效泛化——这一关键差距在以往研究中未得到充分探索。通过系统分析,我们识别出解码过程中的相关视觉信息偏移(Relevant Visual Information Shift, RVIS)是导致失败的主要因素。为解决此问题,我们提出解码阶段偏移感知标记剪枝(Decoding-stage Shift-aware Token Pruning, DSTP),这是一种免训练的即插即用框架,能使现有剪枝方法在解码阶段将视觉标记与不断变化的推理需求对齐。大量实验证明,DSTP显著缓解了剪枝方法在复杂推理任务中的性能退化,同时在视觉理解基准测试中持续带来性能增益。此外,DSTP在多种最先进架构上展现出有效性,突显其泛化能力与极小的计算开销。