We present QUOKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QUOKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselectin the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3x reduction in time-to-first-token, 5x speedup in attention on Nvidia GPUs and up to nearly a 7x speedup on Intel Xeon CPUs, QUOKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.
翻译:本文提出QUOKA:一种面向查询的高效注意力键值选择方法,这是一种无需训练且与硬件无关的稀疏注意力算法,用于加速分块预填充场景下的Transformer推理。尽管注意力算子中的许多查询仅聚焦于较小的键集合,我们观察到与平均查询余弦相似度较低的查询会与更多键产生更强交互,并对最终注意力对数做出最大贡献。通过优先处理这些低余弦相似度查询,可以紧密逼近预填充阶段完整注意力的行为。QUOKA利用这一观察,通过(1)首先保留少量代表性查询,(2)进而筛选与这些查询最匹配的键,实现注意力加速。通过在Needle-In-A-Haystack、LongBench、RULER和Math500上的实验表明,QUOKA在实现首token生成时间降低3倍、Nvidia GPU上注意力计算加速5倍、Intel Xeon CPU上最高近7倍加速的同时,保持接近基线模型的精度,且每次注意力计算使用的键值对数量减少88%。