Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose MixKV, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. MixKV adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that MixKV consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), MixKV improves baseline methods by an average of 5.1% across five multi-modal understanding benchmarks and achieves remarkable gains of 8.0% and 9.0% for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, MixKV extends seamlessly to LLMs with comparable performance gains. Our code is available at https://github.com/xuyang-liu16/MixKV.

翻译：近期的大型视觉语言模型在处理长序列多模态数据方面展现出卓越能力，然而由此产生的键值缓存膨胀造成了关键的内存瓶颈，从根本上限制了部署的可扩展性。现有的KV缓存压缩方法主要聚焦于保留高重要性的KV对以最小化存储，但往往忽视了多模态KV缓存中特有的模态特定语义冗余模式。本文首先分析了LVLMs中的KV缓存如何在基础重要性之外，在不同注意力头之间呈现不同程度的冗余。研究表明，仅依赖重要性只能覆盖KV缓存完整信息分布的子集，可能导致语义覆盖度的损失。为解决这一问题，我们提出MixKV——一种在LVLMs中融合重要性与多样性以优化KV缓存压缩的新方法。MixKV能够自适应头级别的语义冗余，在压缩KV对时选择性平衡多样性与重要性。大量实验表明，MixKV能持续增强多种LVLMs上的现有方法。在极端压缩条件下（预算=64），MixKV在五个多模态理解基准测试中将基线方法平均提升5.1%，并在GUI定位任务中为SnapKV和AdaKV分别实现8.0%和9.0%的显著增益，同时保持相当的推理效率。此外，MixKV可无缝扩展至LLMs并取得可比性能提升。代码已开源：https://github.com/xuyang-liu16/MixKV。