Characterizing the express power of the Transformer architecture is critical to understanding its capacity limits and scaling law. Recent works provide the circuit complexity bounds to Transformer-like architecture. On the other hand, Rotary Position Embedding ($\mathsf{RoPE}$) has emerged as a crucial technique in modern large language models, offering superior performance in capturing positional information compared to traditional position embeddings, which shows great potential in application prospects, particularly for the long context scenario. Empirical evidence also suggests that $\mathsf{RoPE}$-based Transformer architectures demonstrate greater generalization capabilities compared to conventional Transformer models. In this work, we establish a circuit complexity bound for Transformers with $\mathsf{RoPE}$ attention. Our key contribution is that we show that unless $\mathsf{TC}^0 = \mathsf{NC}^1$, a $\mathsf{RoPE}$-based Transformer with $\mathrm{poly}(n)$-precision, $O(1)$ layers, hidden dimension $d \leq O(n)$ cannot solve the Arithmetic formula evaluation problem or the Boolean formula value problem. This result significantly demonstrates the fundamental limitation of the expressivity of the $\mathsf{RoPE}$-based Transformer architecture, although it achieves giant empirical success. Our theoretical result not only establishes the complexity bound but also may instruct further work on the $\mathsf{RoPE}$-based Transformer.
翻译:表征Transformer架构的表达能力对于理解其能力极限与缩放规律至关重要。近期研究为类Transformer架构提供了电路复杂度界限。另一方面,旋转位置嵌入($\mathsf{RoPE}$)已成为现代大语言模型中的关键技术,相较于传统位置嵌入方法,其在捕捉位置信息方面展现出更优越的性能,这为长上下文场景等应用前景带来了巨大潜力。实证证据也表明,基于$\mathsf{RoPE}$的Transformer架构比传统Transformer模型具有更强的泛化能力。本工作中,我们为采用$\mathsf{RoPE}$注意力机制的Transformer建立了电路复杂度界限。我们的核心贡献在于证明:除非$\mathsf{TC}^0 = \mathsf{NC}^1$成立,否则具有$\mathrm{poly}(n)$精度、$O(1)$层数、隐藏维度$d \leq O(n)$的基于$\mathsf{RoPE}$的Transformer无法解决算术公式求值问题或布尔公式求值问题。这一结果显著揭示了基于$\mathsf{RoPE}$的Transformer架构表达能力的基本局限性,尽管其在实证中取得了巨大成功。我们的理论成果不仅确立了复杂度界限,也可能为基于$\mathsf{RoPE}$的Transformer的后续研究提供指导。