Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to $7.9\times$ and achieves $1.52\times$ faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM.
翻译:多模态大语言模型(MLLMs)推动了文本、图像和视频的统一推理,但其推理过程受到键值(KV)缓存快速增长的限制。每个视觉输入会扩展为数千个token,导致缓存随上下文长度线性增长,并在解码过程中常驻GPU内存,即便在高端GPU上也会带来高昂的内存开销和延迟。常见解决方案是在固定分配预算下,以不同粒度压缩缓存:token级均匀丢弃不重要的token,层级调整各层的保留比例,头部级跨注意力头重新分配预算。然而,这些方法止步于分配环节,忽视了注意力头需要不同压缩策略的异质性行为。我们提出HybridKV——一种混合KV缓存压缩框架,通过三个阶段集成互补策略:首先,基于文本中心注意力将注意力头分类为静态或动态类型;其次,采用自上而下的预算分配方案层级分配KV预算;最后,对静态头执行基于文本优先的剪枝压缩,对动态头执行基于分块检索的压缩。在11个多模态基准上使用Qwen2.5-VL-7B进行的实验表明,HybridKV可将KV缓存内存降低高达7.9倍,解码速度提升1.52倍,同时性能几乎无下降,甚至相对于全缓存MLLM有所提升。