Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.
翻译:视觉语言大模型在处理高分辨率输入时面临显著的效率挑战。注意力机制中的二次复杂度与自回归生成过程,以及不断增长的键值缓存大小,严重阻碍了预填充和解码阶段的性能。近期研究尝试通过识别并剪枝重要性较低的令牌来压缩键值缓存,但这些方法通常依赖注意力分数来估计令牌重要性,导致其无法与高效的注意力机制(如FlashAttention和稀疏注意力)兼容,因为这些机制并不显式计算注意力矩阵。此外,现有方法忽视了稀疏注意力在加速预填充阶段的同时,如何改变键值缓存的信息结构,从而影响下游键值缓存压缩策略的有效性。为解决这一问题,我们提出了PureKV,一种即插即用的框架,用于联合优化稀疏注意力与键值缓存压缩。我们首先提出了一种完全兼容高效注意力加速器的键值缓存压缩策略。该方法利用较低层的注意力分数来估计高层键值缓存的重要性,从而在不损失精度的情况下实现主动剪枝。此外,我们专门为视频键值缓存压缩算法设计了一个时空稀疏注意力模块。该模块结合空间与时间注意力的稀疏性,通过净化键值缓存中的空间噪声与时间冗余,提升了键值缓存优化算法的压缩效率。同时,ST-SpAttn也加速了视觉语言大模型的预填充阶段。在多种视觉语言大模型(VideoLLaMA2、Qwen2.5-VL)上的大量实验表明,PureKV实现了5.0倍的键值缓存压缩和3.16倍的预填充加速,且质量损失可忽略不计。