Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.
翻译:近期的大型视觉语言模型(LVLMs)在处理长序列多模态输入方面展现出卓越能力,然而由此产生的键值(KV)缓存膨胀造成了关键的内存瓶颈,从根本上限制了部署的可扩展性。现有的KV缓存压缩方法主要关注保留高重要性的KV对以最小化存储开销,但往往忽视了在多模态KV缓存中独特出现的、模态特定的语义冗余模式。本文首先分析了LVLMs中的KV缓存如何在简单的重要性度量之外,在不同注意力头之间表现出不同程度的冗余性。研究表明,仅依赖重要性度量只能覆盖KV缓存信息分布的一个子集,可能导致语义覆盖度的潜在损失。为解决这一问题,我们提出了一种新颖的方法——\texttt{MixKV},该方法将重要性度量与多样性表征相融合,以实现LVLMs中KV缓存的优化压缩。\texttt{MixKV}能够自适应地处理注意力头级别的语义冗余,在压缩KV对时选择性地平衡多样性与重要性。大量实验表明,\texttt{MixKV}在多种LVLMs上持续提升了现有方法的性能。在极端压缩条件下(预算=64),\texttt{MixKV}在五个多模态理解基准测试中将基线方法平均提升了\textbf{5.1\%},并在GUI基础任务上为SnapKV和AdaKV分别实现了\textbf{8.0\%}和\textbf{9.0\%}的显著增益,同时保持了可比的推理效率。此外,\texttt{MixKV}可无缝扩展到LLMs,并取得相当的性能提升。我们的代码公开于 \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}。