We introduce sliced ReLU attention, a new attention mechanism that departs structurally from both softmax and ReLU-based alternatives. Instead of applying a nonlinearity to pairwise dot products, we operate on one-dimensional projections of key--query differences and leverage sorting to obtain quasi-linear complexity. This construction yields a differentiable, non-symmetric kernel that can be computed in O(n log(n)) through a sorting procedure, making it suitable for very long contexts. Beyond computational benefits, the model retains strong theoretical expressive power: we establish two in-context expressivity results, previously known for softmax attention, showing that sliced ReLU attention preserves the ability to perform nontrivial sequence-to-sequence disentangling tasks and satisfies a contextual universal approximation property. Finally, we illustrate the potential practical interest of this kernel in small-scale experiments.
翻译:我们提出切片ReLU注意力,这是一种新型注意力机制,在结构上区别于传统的softmax和基于ReLU的替代方案。该方法不直接对成对点积应用非线性变换,而是对键-查询差值的一维投影进行操作,并利用排序算法实现准线性复杂度。该结构产生一个可微、非对称的核函数,可通过排序过程以O(n log(n))复杂度计算,适用于极长上下文场景。除计算优势外,该模型保持强大的理论表达能力:我们建立了两个在上下文表达能力方面的结果(此前仅见于softmax注意力),证明切片ReLU注意力能够执行非平凡的序列到序列解耦任务,并满足上下文通用逼近性质。最后,我们通过小规模实验展示了该核函数的潜在应用价值。