Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.
翻译:键值(KV)缓存已成为现代大型视觉语言模型(LVLMs)推理过程中的事实标准组件。虽然它能提升大型语言模型(LLMs)的解码效率,但在LVLMs中的直接应用会因预填充阶段需处理大量视觉token而产生显著的GPU内存开销。为解决这一问题,我们提出LightKV——一种通过挖掘视觉token嵌入冗余性来缩减KV缓存大小的创新方法。该方法以文本提示为引导,采用跨模态消息传递机制聚合视觉token间的信息性内容,并在预填充过程中逐步压缩。这种提示感知的引导策略使我们的方法区别于以往仅基于视觉的压缩策略。我们在涵盖MME、SeedBench等八个公开基准数据集的八款开源LVLMs上对LightKV进行了评估。实验结果表明,在仅保留55%原始视觉token的情况下,LightKV能够:(a) 将视觉token的KV缓存大小减半,(b) 降低高达40%的计算量,(c) 在保持通用性能的同时显著超越现有基线方法。