We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\mathbb{R}^d$, there exists a subset $(K',V')$ of size at most $O({\sqrt{d} e^{ρ+o(ρ)}/\varepsilon})$ such that \[ \left\| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right\| \le \varepsilon \] simultaneously for all queries whose norm is bounded by $ρ$. This outperforms the best known results for this problem. We also offer an improved lower bound showing that $\varepsilon$-coresets must have size $Ω({\sqrt{d} e^ρ/ε})$.
翻译:我们研究了在小空间中估计注意力机制的问题,并证明了该机制存在近乎最优规模的核心集。具体而言,我们证明:对于任意一组单位范数的键和值 $(K,V) \in \mathbb{R}^d$,存在一个大小不超过 $O({\sqrt{d} e^{ρ+o(ρ)}/\varepsilon})$ 的子集 $(K',V')$,使得对所有范数不超过 $ρ$ 的查询 $q$,同时满足
\[
\left\| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right\| \le \varepsilon。
\]
该结果优于该问题已知的最佳结论。我们还提出一个改进的下界,表明 $\varepsilon$-核心集的大小必须至少为 $Ω({\sqrt{d} e^ρ/ε})$。