Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive low-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space framework. SALS projects the KV cache into a compact latent space via low-rank projection, and performs sparse token selection using RoPE-free query-key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full KV cache reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6.4-fold KV cache compression and 5.7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively.
翻译:能够处理长上下文的大语言模型需求旺盛,但其推理过程仍面临挑战,主要源于巨大的键值缓存规模和高内存带宽需求。先前研究表明,KV缓存在隐藏维度上呈现低秩特性,这暗示了有效压缩的潜力。然而,由于现代大语言模型广泛采用旋转位置编码机制,简单的低秩压缩会导致严重的精度下降或产生新的速度瓶颈,因为低秩缓存必须先重建才能应用RoPE。本文提出两个关键发现:首先,对键向量应用RoPE会增大其方差,进而导致秩升高;其次,键向量转换至潜在空间后,在多数层中其表征基本保持不变。基于这些发现,我们提出潜在空间稀疏注意力框架。SALS通过低秩投影将KV缓存映射至紧凑的潜在空间,并在此空间内利用无RoPE的查询-键交互执行稀疏令牌选择。通过仅重建少量重要令牌的子集,该方法避免了完整KV缓存重建的开销。我们使用LLaMA2-7b-chat和Mistral-7b两个大规模模型在多种任务上全面评估SALS,并额外在RULER-128k基准测试中使用LLaMA3.1-8B-Instruct验证其可扩展性。实验结果表明,SALS在保持竞争力的精度下实现了最先进的性能。在不同设置下,SALS在4K序列长度上相比FlashAttention2实现了6.4倍的KV缓存压缩和5.7倍的注意力算子加速。在端到端吞吐性能方面,相比GPT-fast在4K和32K序列上分别实现了1.4倍和4.5倍的提升。