Exact Linear Attention - 专知论文

This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by exploiting the exact decomposition property of kernel functions, thereby eliminating approximation error. We identify and address two key limitations of prior linear attention -- gradient explosion and token attention dilution -- by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel, each tailored for specific attention behaviors. Beyond the core attention formulation, the paper presents three engineering innovations: (1) a Hyper-Link structure that replaces traditional residual connections to mitigate gradient degradation; (2) a Memory Lobe module based on bidirectional linear attention, which captures "transformation flow" across layers to implement qualitative memory and an implicit reinforcement learning paradigm; and (3) a routing-score-based bias mechanism for Mixture-of-Experts (MoE) to improve interpretability and semantic alignment. Experimental results demonstrate that ELA achieves up to 6x faster decoding speed and 75% reduction in KV cache memory usage compared to full attention, while maintaining comparable or superior training performance. The proposed memory module accelerates convergence and enhances generalization. Furthermore, we extend the linear attention principle to vision models, yielding YOLO-LAT, which attains up to 4.3x GPU inference speedup and 7.9x parameter reduction with competitive detection accuracy. These results underline the broad applicability of exact linear attention for scaling Transformer models to ultra-long sequences and efficient visual tasks.

翻译：本文提出精确线性注意力（Exact Linear Attention, ELA），一种利用核函数的精确分解特性实现Transformer注意力线性计算复杂度的机制，从而消除近似误差。我们识别并解决了先前线性注意力的两个关键局限性——梯度爆炸与令牌注意力稀释——通过施加核函数约束确保非负性、可判别性和几何可解释性。我们提出了多种核函数，包括Hadamard指数核、求和平方欧氏距离核与差分平方欧氏距离核，每种核函数针对特定注意力行为设计。除核心注意力公式外，本文还介绍了三项工程创新：(1) 超链接结构替代传统残差连接以缓解梯度退化；(2) 基于双向线性注意力的记忆叶模块，跨层捕捉"变换流"以实现定性记忆和隐式强化学习范式；(3) 基于路由分数的混合专家（MoE）偏置机制，提升可解释性和语义对齐。实验结果表明，与完整注意力相比，ELA在保持相当或更优训练性能的同时，实现了最高6倍解码速度提升和75% KV缓存内存减少。所提出的记忆模块可加速收敛并增强泛化能力。此外，我们将线性注意力原理扩展至视觉模型，提出YOLO-LAT，在保持竞争性检测精度的同时，实现了最高4.3倍GPU推理加速和7.9倍参数量减少。这些结果突显了精确线性注意力在将Transformer模型扩展至超长序列及高效视觉任务中的广泛适用性。