The inference process for large language models is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces "Double Sparsity," a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve $\frac{1}{16}$ token and channel sparsity with minimal impact on accuracy across various tasks, including wiki-2 perplexity, key-value retrieval, and long context benchmarks with models including Llama-2-7B, Llama-2-70B, and Mixtral-8x7B. It brings up to a 14.1$\times$ acceleration in attention operations and a 1.9$\times$ improvement in end-to-end inference on GPUs. With offloading, it achieves a decoding speed acceleration of 16.3$\times$ compared to state-of-the-art solutions at a sequence length of 256K. Our code is publicly available at https://github.com/andy-yang-1/DoubleSparse.
翻译:大型语言模型的推理过程缓慢且内存密集,其中一个关键瓶颈在于过多的键值(KV)缓存访问。本文提出“双重稀疏性”,一种新颖的后训练稀疏注意力技术,旨在通过减少KV缓存访问来缓解这一瓶颈。双重稀疏性结合了令牌稀疏性与通道稀疏性:令牌稀疏性专注于仅使用重要令牌计算自注意力,而通道稀疏性则利用重要特征通道来识别重要令牌。我们的核心洞见在于,通道稀疏性的模式相对静态,这使得我们可以通过离线校准使其在运行时高效,从而实现准确且高效的重要令牌识别。此外,该方法可与卸载技术结合,显著降低内存使用。实验结果表明,双重稀疏性能够在多种任务(包括wiki-2困惑度、键值检索以及长上下文基准测试,模型涵盖Llama-2-7B、Llama-2-70B和Mixtral-8x7B)中实现$\frac{1}{16}$的令牌与通道稀疏度,同时对精度影响极小。它在注意力操作上带来高达14.1$\times$的加速,并在GPU上实现端到端推理1.9$\times$的提升。结合卸载技术,在序列长度为256K时,相比现有最优解决方案,其解码速度加速达到16.3$\times$。我们的代码公开于https://github.com/andy-yang-1/DoubleSparse。