Recent advancements in Large Visual Language Models (LVLMs) have gained significant attention due to their remarkable reasoning capabilities and proficiency in generalization. However, processing a large number of visual tokens and generating long-context outputs impose substantial computational overhead, leading to excessive demands for key-value (KV) cache. To address this critical bottleneck, we propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference. This work systematically investigates the correlations between visual and textual tokens within the attention mechanisms of LVLMs. Our empirical analysis reveals considerable redundancy in cached visual tokens, wherein strategically eliminating these tokens preserves model performance while significantly accelerating context generation. Inspired by these findings, we introduce an elite observation window for assessing the importance of visual components in the KV cache, focusing on stable inter-modal relevancy modeling with enhanced multi-perspective consistency. Additionally, we develop an adaptive layer-wise budget allocation strategy that capitalizes on the strength and skewness of token importance distribution, showcasing superior efficiency compared to uniform allocation. Comprehensive evaluations across multiple LVLMs and benchmarks demonstrate that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache, thereby reducing decoding latency by 29% to 66% across various batch size and prompt length of inputs. Notably, as cache retention rates decrease, our method exhibits increasing performance advantages over existing approaches.
翻译:近年来,大型视觉语言模型(LVLMs)因其卓越的推理能力和泛化熟练度而受到广泛关注。然而,处理大量视觉标记并生成长上下文输出带来了巨大的计算开销,导致对键值(KV)缓存的需求过高。为应对这一关键瓶颈,我们提出了AirCache,一种旨在加速LVLMs推理的新型KV缓存压缩方法。本研究系统性地探究了LVLMs注意力机制中视觉与文本标记之间的相关性。我们的实证分析揭示了缓存视觉标记中存在显著冗余,其中策略性地剔除这些标记可在保持模型性能的同时显著加速上下文生成。受此发现启发,我们引入了一个精英观察窗口来评估KV缓存中视觉成分的重要性,重点关注具有增强多视角一致性的稳定跨模态相关性建模。此外,我们开发了一种自适应分层预算分配策略,该策略利用标记重要性分布的强度与偏度,展现出相较于均匀分配的优越效率。在多个LVLMs和基准测试上的综合评估表明,我们的方法在仅保留10%视觉KV缓存的情况下,达到了与完整缓存相当的性能,从而在各种批次大小和输入提示长度下将解码延迟降低了29%至66%。值得注意的是,随着缓存保留率的降低,我们的方法相较于现有方法展现出日益增长的性能优势。