Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during the softmax function computation. Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment. Despite its widespread use, a theoretical understanding of the conditions under which sparse attention performs on par with traditional attention remains elusive. This work aims to $\textbf{bridge this gap by examining the inherent sparsity of standard attention processes}$. Our theoretical framework reveals several brand-new key insights: $\bullet$ Attention is $n^{C}$-sparse, implying that considering only the largest $\Omega(n^{C})$ entries out of all $n$ entries is sufficient for sparse attention to approximate the exact attention matrix with decreasing loss. Here, $n$ represents the input length and $C \in (0, 1)$ is a constant. $\bullet$ Stable $o(\log(n))$-sparse attention, which approximates attention computation with $\log(n)$ or fewer entries, may not be feasible since the error will persist at a minimum of $O(1)$. $\bullet$ An adaptive strategy ($\alpha \cdot n^C, \alpha \in \mathbb{R}$) for the window size of efficient attention methods rather than a fixed one is guaranteed to perform more accurately and efficiently in a task for inference on flexible context lengths.
翻译:稀疏注意力是一种以亚二次复杂度逼近标准注意力计算的技术。其通过在softmax函数计算过程中选择性忽略注意力矩阵中的较小条目来实现。该技术的多种变体,如剪枝KV缓存、基于稀疏性的快速注意力以及稀疏Transformer,已被广泛用于高效的大型语言模型部署。尽管应用广泛,但关于稀疏注意力在何种条件下能与传统注意力表现相当的理论理解仍然缺乏。本研究旨在通过考察标准注意力过程的内在稀疏性来弥合这一差距。我们的理论框架揭示了若干全新的关键见解:$\bullet$ 注意力具有$n^{C}$稀疏性,这意味着仅考虑全部$n$个条目中最大的$\Omega(n^{C})$个条目,便足以使稀疏注意力以递减的损失逼近精确注意力矩阵。其中$n$表示输入长度,$C \in (0, 1)$为常数。$\bullet$ 稳定的$o(\log(n))$-稀疏注意力(即使用$\log(n)$或更少条目来近似注意力计算)可能不可行,因为误差将至少持续保持在$O(1)$量级。$\bullet$ 对于高效注意力方法的窗口大小,采用自适应策略($\alpha \cdot n^C, \alpha \in \mathbb{R}$)而非固定大小,可保证在灵活上下文长度的推理任务中实现更高精度与效率。