Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.
翻译:线性注意力方法为Transformer提供了$O(N)$的计算复杂度,但通常表现不及标准的softmax注意力。我们发现影响这些方法的两个根本性限制:仅限于凸组合的约束仅允许加性信息融合,以及均匀累积权重偏差在长上下文中会稀释注意力。我们提出零和线性注意力(ZeroS),通过移除常数零阶项$1/t$并对剩余的零和softmax残差进行重新加权,从而解决这些限制。这一修改创造了数学上稳定的权重,允许正值和负值共存,使得单个注意力层能够执行对比操作。在保持$O(N)$复杂度的同时,ZeroS在理论上扩展了可表示函数的集合,超越了凸组合的范围。实证结果表明,在各种序列建模基准测试中,ZeroS达到或超越了标准softmax注意力的性能。