Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $α$-entmax attention with $α= 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.
翻译:近期研究揭示了Transformer中的自注意力机制与通过Nadaraya-Watson估计器实现的测试时核回归之间的联系,其中标准softmax注意力对应于高斯核。然而,目前对稀疏注意力机制的核理论理解尚不完善。本文建立了稀疏注意力与紧凑(有界支撑)核之间的形式化对应关系。我们证明归一化ReLU和sparsemax注意力分别源于固定归一化和自适应归一化下的Epanechnikov核回归。更一般地,我们证明了非参数密度估计中广泛使用的核——包括Epanechnikov核、双权重核和三权重核——对应于$α= 1 + \frac{1}{n}$($n \in \mathbb{N}$)的$α$-entmax注意力,而softmax/高斯关系在极限$n \to \infty$时出现。这一统一视角解释了稀疏性如何从核设计中自然产生,并为启发式的top-$k$注意力及其他联想记忆机制提供了理论依据的替代方案。基于核回归的Transformer变体——Memory Mosaics——的实验表明,基于核的稀疏注意力在语言建模、上下文学习和长度泛化任务中取得了具有竞争力的性能,为注意力机制的设计提供了理论框架。