Characterizing the express power of the Transformer architecture is critical to understanding its capacity limits and scaling law. Recent works provide the circuit complexity bounds to Transformer-like architecture. On the other hand, Rotary Position Embedding ($\mathsf{RoPE}$) has emerged as a crucial technique in modern large language models, offering superior performance in capturing positional information compared to traditional position embeddings, which shows great potential in application prospects, particularly for the long context scenario. Empirical evidence also suggests that $\mathsf{RoPE}$-based Transformer architectures demonstrate greater generalization capabilities compared to conventional Transformer models. In this work, we establish a tighter circuit complexity bound for Transformers with $\mathsf{RoPE}$ attention. Our key contribution is that we show that unless $\mathsf{TC}^0 = \mathsf{NC}^1$, a $\mathsf{RoPE}$-based Transformer with $\mathrm{poly}(n)$-precision, $O(1)$ layers, hidden dimension $d \leq O(n)$ cannot solve the arithmetic problem or the Boolean formula value problem. This result significantly demonstrates the fundamental limitation of the expressivity of the $\mathsf{RoPE}$-based Transformer architecture, although it achieves giant empirical success. Our theoretical framework not only establishes tighter complexity bounds but also may instruct further work on the $\mathsf{RoPE}$-based Transformer.
翻译:刻画Transformer架构的表达能力对于理解其能力极限与缩放定律至关重要。近期研究为类Transformer架构提供了电路复杂度界限。另一方面,旋转位置嵌入($\mathsf{RoPE}$)已成为现代大语言模型中的关键技术,相比传统位置嵌入方法,其在捕获位置信息方面展现出更优越的性能,这显示出广阔的应用前景,尤其适用于长上下文场景。实证证据也表明,基于$\mathsf{RoPE}$的Transformer架构比传统Transformer模型具有更强的泛化能力。本工作中,我们为采用$\mathsf{RoPE}$注意力机制的Transformer建立了更严格的电路复杂度界限。我们的核心贡献在于证明:除非$\mathsf{TC}^0 = \mathsf{NC}^1$成立,否则具有$\mathrm{poly}(n)$精度、$O(1)$层数、隐藏维度$d \leq O(n)$的基于$\mathsf{RoPE}$的Transformer无法解决算术问题或布尔公式求值问题。这一结果显著揭示了基于$\mathsf{RoPE}$的Transformer架构表达能力的基本局限,尽管其在实证层面取得了巨大成功。我们的理论框架不仅建立了更严格的复杂度界限,也可能为基于$\mathsf{RoPE}$的Transformer的后续研究提供指导。