Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM -- a simple and general architecture by growing a lightweight twig upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Code will be made publicly available.
翻译:大型视觉语言模型(VLM)在开放世界多模态理解方面展现出卓越能力,但其高昂的计算开销为实际部署带来了巨大挑战。近期一些研究提出通过基于VLM早期层注意力图剪枝冗余视觉令牌的方法来加速VLM。尽管这些令牌剪枝方法取得了成功,但仍存在两个主要缺陷:(i)由于早期层注意力信号不敏感导致准确率显著下降;(ii)生成长响应(例如30个令牌)时加速效果有限。为克服上述局限,我们提出TwigVLM——通过在基础VLM的早期层上培育轻量级枝干的简洁通用架构。与大多数现有仅基于视觉令牌剪枝的VLM加速方法相比,我们的TwigVLM不仅通过采用枝干引导令牌剪枝策略实现了更好的精度保持,还通过自推测解码策略获得了更高的生成速度。以LLaVA-1.5-7B作为基础VLM的实验结果表明,TwigVLM在剪除88.9%视觉令牌后仍保持原始性能的96%,并在生成长响应时实现154%的加速,在精度和速度方面均显著优于最先进的VLM加速方法。代码将公开提供。