Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic innovations and hardware-level considerations. In addition, we analyze the incorporation of efficient attention into largescale pre-trained language models, including both architectures built entirely on efficient attention and hybrid designs that combine local and global components. By aligning theoretical foundations with practical deployment strategies, this work aims to serve as a foundational reference for advancing the design of scalable and efficient language models.
翻译:基于Transformer的架构已成为大语言模型的主流骨干网络。然而,自注意力机制的二次时间与内存复杂度仍是实现高效长上下文建模的根本障碍。为突破这一限制,近期研究提出了两大主要类别的高效注意力机制。线性注意力方法通过核近似、循环公式或快速权重动态实现线性复杂度,从而以较低计算开销实现可扩展推理。稀疏注意力技术则通过固定模式、分块路由或聚类策略将注意力计算限制在选定词元子集,在保持上下文覆盖的同时提升计算效率。本综述系统性地梳理了这些进展,整合了算法创新与硬件层面的考量。此外,我们分析了高效注意力机制在大规模预训练语言模型中的融合方式,包括完全基于高效注意力的架构以及结合局部与全局组件的混合设计。通过将理论基础与实际部署策略相结合,本文旨在为推进可扩展高效语言模型的设计提供基础性参考。