Large Vision-Language Models (LVLMs) rely on dense visual tokens to capture fine-grained visual information, but processing all these tokens incurs substantial computational and memory overhead during inference. To address this issue, we propose ResPrune, a training-free visual token pruning framework that enables efficient LVLM inference by selecting a compact yet informative subset of visual tokens. ResPrune formulates visual token pruning as a subspace reconstruction problem and employs a greedy subspace expansion strategy guided by residual energy, allowing it to preserve the geometric structure of the original visual token space. To further incorporate cross modal alignment, the selection process is conditioned on textual relevance, encouraging the retention of tokens that are both informative and instruction-relevant. The proposed method is lightweight and model-agnostic, and can be seamlessly integrated into existing LVLM pipelines without retraining or architectural modifications. Extensive experiments on multiple LVLM backbones, including LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL, demonstrate that ResPrune consistently outperforms existing pruning approaches across a wide range of benchmarks, while achieving effective reductions in computation, memory consumption, and inference latency.
翻译:大型视觉语言模型依赖密集的视觉标记来捕捉细粒度视觉信息,但在推理过程中处理所有这些标记会产生大量的计算和内存开销。为解决这一问题,我们提出ResPrune——一种无需训练的视觉标记剪枝框架,通过选择紧凑且信息丰富的视觉标记子集实现高效的LVLM推理。ResPrune将视觉标记剪枝形式化为子空间重建问题,并采用基于残差能量的贪心子空间扩展策略,从而保留原始视觉标记空间的几何结构。为进一步融入跨模态对齐,选择过程以文本相关性为条件,鼓励保留既具信息量又与指令相关的标记。该方法轻量且与模型无关,可无缝集成到现有LVLM流程中,无需重新训练或修改架构。在LLaVA-1.5、LLaVA-NeXT和Qwen2.5-VL等多个LVLM骨干上的大量实验表明,ResPrune在多个基准测试中持续优于现有剪枝方法,同时有效降低了计算量、内存消耗和推理延迟。