Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number of vision tokens processed during the prefill stage. Existing pruning methods often rely on directly using the attention patterns or static text prompt guidance, failing to exploit the dynamic internal signals generated during inference. To address these issues, we propose AdaptInfer, a plug-and-play framework for adaptive vision token pruning in VLMs. First, we introduce a fine-grained, dynamic text-guided pruning mechanism that reuses layer-wise text-to-text attention maps to construct soft priors over text-token importance, allowing more informed scoring of vision tokens at each stage. Second, we perform an offline analysis of cross-modal attention shifts and identify consistent inflection locations in inference, which inspire us to propose a more principled and efficient pruning schedule. Our method is lightweight and plug-and-play, also generalizable across multi-modal tasks. Experimental results have verified the effectiveness of the proposed method. For example, it reduces CUDA latency by 61.3% while maintaining an average accuracy of 93.1% on vanilla LLaVA-1.5-7B. Under the same token budget, AdaptInfer surpasses SOTA in accuracy.
翻译:视觉语言模型(VLM)在视觉问答、图像描述等多模态推理任务上取得了令人瞩目的性能,但由于预填充阶段需要处理大量视觉令牌,其推理成本仍然是一个重大挑战。现有的剪枝方法通常直接利用注意力模式或静态文本提示引导,未能充分利用推理过程中生成的动态内部信号。为解决这些问题,我们提出了AdaptInfer,一个用于VLM自适应视觉令牌剪枝的即插即用框架。首先,我们引入了一种细粒度的动态文本引导剪枝机制,该方法重用层级的文本-文本注意力图来构建文本令牌重要性的软先验,从而能够在每个阶段对视觉令牌进行更精准的评分。其次,我们对跨模态注意力偏移进行了离线分析,识别出推理过程中一致的拐点位置,这启发我们提出了一种更具原则性且高效的剪枝调度策略。我们的方法轻量且即插即用,同时能泛化至多模态任务。实验结果验证了所提方法的有效性。例如,在保持原始LLaVA-1.5-7B模型平均准确率93.1%的同时,将CUDA延迟降低了61.3%。在相同令牌预算下,AdaptInfer在准确率上超越了当前最优方法。