Deep neural networks are widely believed to derive their expressive power from their ability to form \textbf{hierarchical representations}, capturing progressively more abstract and compositional features across layers. In language modeling, \textbf{transformers} have emerged as the dominant architecture, with early layers capturing local syntactic patterns and later layers encoding more complex clause-level dependencies. While this intuition has shaped model design, there remains a lack of rigorous theoretical work demonstrating \textbf{how} deep transformers represent such hierarchical structures. In this work, we analyze the expressiveness of deep transformer models through the formal lens of bounded-depth, non-recursive context-free grammars. For this class of grammars, we explicitly construct transformers with positional attention whose depth grows linearly with grammar depth, while the neuron count scales with the number of derivation-tree shapes and quadratically with the number of production rules. Our theoretical results support the linear representation hypothesis by demonstrating that these architectures possess the structural capacity to encode abstract grammatical states into low-dimensional, linearly separable subspaces within the residual stream.
翻译:深度神经网络普遍被认为其表达能力源于形成\textbf{层次化表征}的能力,能够跨层捕获逐渐抽象且具有组合性的特征。在语言建模中,\textbf{Transformer}已成为主导架构,其早期层捕获局部句法模式,而深层编码更复杂的从句级依赖关系。尽管这种直觉塑造了模型设计,但关于\textbf{如何}实现深度Transformer表示此类层次结构,仍缺乏严谨的理论工作。本文通过有界深度、非递归上下文无关文法的形式化视角,分析深度Transformer模型的表达力。针对此类文法,我们显式构建了具有位置注意力的Transformer,其深度随文法深度线性增长,而神经元数量随推导树形状数量呈线性增长,且与产生式规则数量呈二次方关系。我们的理论结果支持线性表征假说,证明这些架构具备将抽象语法状态编码为残差流中低维、线性可分子空间的结构能力。