Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method that performs pruning at model-specific layers with strong visual token selection capability, while enabling independent pruning decisions across layers. Experiments across multiple VLMs and benchmarks demonstrate that SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior.
翻译:视觉令牌剪枝是降低视觉语言模型计算成本的一种有前景的方法,现有方法通常依赖早期剪枝决策以提高效率。尽管在粗粒度推理任务上有效,但这些方法在需要细粒度视觉细节的任务上会遭受显著的性能下降。通过逐层分析,我们揭示了视觉令牌重要性在不同层间存在显著差异,表明在浅层被认为不重要的令牌可能在后续的文本条件推理中变得高度相关。为避免因过早剪枝导致的不可逆关键信息丢失,我们引入了一种称为旁路的新剪枝范式,该范式保留未选中的视觉令牌并将其转发至后续剪枝阶段进行重新评估。基于此范式,我们提出了SwiftVLM,这是一种简单且无需训练的方法,它在具有强视觉令牌选择能力的模型特定层执行剪枝,同时支持跨层的独立剪枝决策。在多个VLM和基准测试上的实验表明,SwiftVLM始终优于现有剪枝策略,实现了更优的精度-效率权衡以及更可靠的视觉令牌选择行为。