Efficient Attention Mechanisms for Large Language Models: A Survey

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic innovations and hardware-level considerations. In addition, we analyze the incorporation of efficient attention into largescale pre-trained language models, including both architectures built entirely on efficient attention and hybrid designs that combine local and global components. By aligning theoretical foundations with practical deployment strategies, this work aims to serve as a foundational reference for advancing the design of scalable and efficient language models.

翻译：基于Transformer的架构已成为大语言模型的主流骨干网络。然而，自注意力机制的二次时间与内存复杂度仍是实现高效长上下文建模的根本障碍。为突破这一限制，近期研究提出了两大主要类别的高效注意力机制。线性注意力方法通过核近似、循环公式或快速权重动态实现线性复杂度，从而以较低计算开销实现可扩展推理。稀疏注意力技术则通过固定模式、分块路由或聚类策略将注意力计算限制在选定词元子集，在保持上下文覆盖的同时提升计算效率。本综述系统性地梳理了这些进展，整合了算法创新与硬件层面的考量。此外，我们分析了高效注意力机制在大规模预训练语言模型中的融合方式，包括完全基于高效注意力的架构以及结合局部与全局组件的混合设计。通过将理论基础与实际部署策略相结合，本文旨在为推进可扩展高效语言模型的设计提供基础性参考。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

扩散模型中的注意力机制：综述

专知会员服务

24+阅读 · 2025年4月10日

大型语言模型的模型压缩与高效推理：综述

专知会员服务

94+阅读 · 2024年2月17日

【Google】高效Transformer综述，Efficient Transformers: A Survey

专知会员服务

66+阅读 · 2022年3月17日

最新「注意力机制」大综述论文，66页pdf569篇文献

专知会员服务

210+阅读 · 2021年4月2日