Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as images or videos. While existing KV cache compression methods are effective for Large Language Models (LLMs), directly migrating them to VLMs yields suboptimal accuracy and speedup. To bridge the gap, we propose VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. In this paper, we first investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. Based on these observations, we introduce a layer-adaptive sparsity-aware cache budget allocation method that effectively distributes the limited cache budget across different layers, further reducing KV cache size without compromising accuracy. Additionally, we develop a modality-aware token scoring policy to better evaluate the token importance. Empirical results on multiple benchmark datasets demonstrate that retaining only 10% of KV cache achieves accuracy comparable to that with full cache. In a speed benchmark, our method accelerates end-to-end latency of generating 100 tokens by up to 2.33x and speeds up decoding by up to 7.08x, while reducing the memory footprint of KV cache in GPU by 90%.
翻译:视觉语言模型(VLMs)在多种任务上展现出卓越的性能。加速VLMs的一个关键挑战在于存储和访问编码长视觉上下文(如图像或视频)的大型键值(KV)缓存。现有的KV缓存压缩方法对于大语言模型(LLMs)虽然有效,但直接迁移到VLMs会导致精度和加速效果欠佳。为弥补这一差距,我们提出了VL-Cache,一种专为加速VLM推理而设计的新型KV缓存压缩方案。本文首先通过区分预填充和解码阶段的视觉与文本令牌,研究了VLM注意力机制独特的稀疏性模式。基于这些观察,我们引入了一种层自适应稀疏性感知的缓存预算分配方法,能在不同层间有效分配有限的缓存预算,从而在不损失精度的前提下进一步减小KV缓存大小。此外,我们开发了一种模态感知的令牌评分策略,以更好地评估令牌重要性。在多个基准数据集上的实证结果表明,仅保留10%的KV缓存即可达到与完整缓存相当的精度。在速度基准测试中,我们的方法将生成100个令牌的端到端延迟最高加速2.33倍,解码速度最高提升7.08倍,同时将GPU中的KV缓存内存占用减少90%。