Transformers serve as the foundation of most modern large language models. To mitigate the quadratic complexity of standard full attention, various efficient attention mechanisms, such as linear and hybrid attention, have been developed. A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization. In this work, we theoretically characterize the performance differences among these attention mechanisms. Our theory applies to all linear attention variants that can be formulated as a recurrence, including Mamba, DeltaNet, etc. Specifically, we establish an expressiveness hierarchy: for the sequential function composition-a multi-step reasoning task that must occur within a model's forward pass, an ($L+1$)-layer full attention network is sufficient, whereas any hybrid network interleaving $L-1$ layers of full attention with a substantially larger number ($2^{3L^2}$) of linear attention layers cannot solve it. This result demonstrates a clear separation in expressive power between the two types of attention. Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.
翻译:Transformer 构成了大多数现代大语言模型的基础。为缓解标准全注意力的二次复杂度,研究者开发了多种高效注意力机制,例如线性注意力与混合注意力。一个根本性差距依然存在:这些机制相对于全注意力的表达能力缺乏严格的理论刻画。在本工作中,我们从理论上刻画了这些注意力机制之间的性能差异。我们的理论适用于所有可表述为递归形式的线性注意力变体,包括 Mamba、DeltaNet 等。具体而言,我们建立了一个表达能力层级:对于序列函数组合——一种必须在模型前向传播过程中完成的多步推理任务,一个 ($L+1$) 层的全注意力网络足以解决,而任何将 $L-1$ 层全注意力与数量显著更多($2^{3L^2}$ 层)的线性注意力层交错组成的混合网络均无法解决该任务。这一结果证明了两种注意力类型在表达能力上存在明确分离。我们的工作首次提供了混合注意力与标准全注意力之间可证明的分离,为理解不同注意力机制的基本能力与局限性提供了理论视角。