A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning. However, the attention maps from all layers requires a full inference pass, which increases computational load and is therefore impractical in existing methods; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM, suggesting an efficient alternative. Based on these findings, we introduce a \textbf{training-free} method, \underline{\textbf{S}}mall VLM \underline{\textbf{G}}uidance for accelerating \underline{\textbf{L}}arge VLMs (\textbf{SGL}). Specifically, we employ the attention map aggregated from a small VLM to guide visual token pruning in a large VLM. Additionally, an early exiting mechanism is developed to fully use the small VLM's predictions, dynamically invoking the larger VLM only when necessary, yielding a superior trade-off between accuracy and computation. Extensive evaluations across 11 benchmarks demonstrate the effectiveness and generalizability of SGL, achieving up to 91\% pruning ratio for visual tokens while retaining competitive performance.

翻译：视觉语言模型（VLMs）在多模态任务中展现出卓越性能，但大型VLMs因处理大量视觉令牌而面临显著的效率挑战。加速大型VLM推理的一种可行方法是利用局部信息（如特定层的注意力图）评估令牌重要性并剪枝次要令牌。然而，本研究揭示三个关键发现：（i）局部注意力信息不足以准确识别关键视觉令牌，尤其在低令牌保留率下会导致次优性能；（ii）全局注意力信息（如跨所有层聚合的注意力图）能更有效地保留关键令牌，并在激进剪枝下保持可比性能。但获取全层注意力图需完整推理过程，这会增加计算负载，在现有方法中并不实用；（iii）从小型VLM聚合的全局注意力图与大型VLM高度相似，这提供了一种高效替代方案。基于这些发现，我们提出一种\textbf{无需训练}的方法——\underline{\textbf{S}}mall VLM \underline{\textbf{G}}uidance for accelerating \underline{\textbf{L}}arge VLMs（\textbf{SGL}）。具体而言，我们利用小型VLM聚合的注意力图指导大型VLM的视觉令牌剪枝。此外，开发了早期退出机制以充分利用小型VLM的预测结果，仅在必要时动态调用大型VLM，从而在精度与计算量间实现更优权衡。在11个基准测试上的广泛评估证明了SGL的有效性与泛化能力，其视觉令牌剪枝率最高可达91\%，同时保持具有竞争力的性能。