Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.
翻译:大语言模型(LLM)推动了自然语言处理各项任务的显著进步,其中长上下文模型因能处理扩展输入而日益受到重视。然而,Transformer架构所需不断扩大的键值(KV)缓存规模加剧了内存限制,尤其在解码阶段形成了显著的性能瓶颈。现有旨在缓解此瓶颈的稀疏注意力机制存在两个局限:(1)往往无法可靠识别最相关的注意力令牌;(2)忽略了连续Transformer层间令牌选择的空间连贯性,这可能导致性能下降并产生显著的令牌选择开销。本文提出TidalDecode,一种通过位置持续性稀疏注意力实现快速准确LLM解码的简洁高效算法与系统。TidalDecode利用现有稀疏注意力方法所选令牌的空间连贯性,引入少量执行完整注意力的令牌选择层以识别具有最高注意力分数的令牌,而其余所有层均使用预选令牌执行稀疏注意力。该设计使TidalDecode能在不牺牲生成结果质量的前提下,大幅降低稀疏注意力的令牌选择开销。在多样化LLM和任务上的评估表明,TidalDecode在生成性能上与完整注意力方法高度吻合,同时将LLM解码延迟降低最高达2.1倍。