Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.
翻译:大型视觉-语言模型(LVLMs)在多模态任务中取得了卓越成功,但其实际部署仍受限于长序列视觉标记带来的计算负担。尽管视觉标记剪枝已成为一种有前景的解决方案,现有方法存在根本性局限:一旦标记在特定层被剪除,后续所有层均无法访问,导致过早的信息丢失,进而可能损害模型性能。通过实证研究,我们观察到不同层展现出不同的视觉区域关注重点,表明各层存在各自的最优标记子集。受此启发,我们提出自适应逐层视觉标记选择(ALVTS),这是一种打破传统静态标记剪枝范式的新框架。ALVTS引入轻量级标记选择器,识别并路由重要标记进行后续处理,同时允许非重要标记跳过当前层,从而最小化计算冗余。这两类标记流在送入后续层之前被无缝重新整合,实现整个模型的自适应压缩。基于重要性一致性约束的低秩近似,所提出的标记选择模块紧密模拟完整注意力机制,有效捕捉其核心模式,且无需模型重训练。在LLaVA-1.5、LLaVA-NeXT和Qwen2.5-VL上的大量实验验证了本方法的有效性。在89%的标记压缩比下,ALVTS保留了原始模型96.7%的准确率,实现了LVLM推理中卓越的效率-精度权衡。