Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42\times$ compared with FlashAttention.
翻译:大语言模型(LLM)现已支持极长的上下文窗口,但原始注意力机制的二次复杂度导致了显著增长的首令牌生成时间(TTFT)延迟。现有解决该复杂度的方法通常需要额外的预训练或微调,且往往以牺牲模型精度为代价。本文首先为近无损稀疏注意力提供了理论与实证基础。我们发现,以低开销在运行时动态捕获头部特定的稀疏模式至关重要。为此,我们提出了SampleAttention——一种自适应的结构化近无损稀疏注意力机制。该方法利用观测到的重要稀疏模式,对固定比例的相邻令牌进行注意力计算以捕捉局部窗口模式,并采用一种两阶段的查询引导键值过滤策略,该策略能以低开销自适应地选择最小键值集,从而捕捉列条带模式。综合评估表明,SampleAttention能够无缝替换现成LLM中的原始注意力机制,且几乎无精度损失,与FlashAttention相比,TTFT最高可降低$2.42\times$。