Transformer networks have achieved remarkable empirical success across a wide range of applications, yet their theoretical expressive power remains insufficiently understood. In this paper, we study the expressive capabilities of Transformer architectures. We first establish an explicit approximation of maxout networks by Transformer networks while preserving comparable model complexity. As a consequence, Transformers inherit the universal approximation capability of ReLU networks under similar complexity constraints. Building on this connection, we develop a framework to analyze the approximation of continuous piecewise linear functions by Transformers and quantitatively characterize their expressivity via the number of linear regions, which grows exponentially with depth. Our analysis establishes a theoretical bridge between approximation theory for standard feedforward neural networks and Transformer architectures. It also yields structural insights into Transformers: self-attention layers implement max-type operations, while feedforward layers realize token-wise affine transformations.
翻译:Transformer网络在广泛的应用中取得了显著的实证成功,但其理论表达能力仍未得到充分理解。本文研究了Transformer架构的表达能力。我们首先在保持相近模型复杂度的前提下,建立了Transformer网络对maxout网络的显式逼近。由此,Transformer在相似的复杂度约束下继承了ReLU网络的通用逼近能力。基于这一联系,我们建立了一个分析Transformer逼近连续分段线性函数的框架,并通过线性区域的数量(其随深度呈指数增长)定量刻画了它们的表达能力。我们的分析在标准前馈神经网络的逼近理论与Transformer架构之间建立了理论桥梁。该分析还揭示了Transformer的结构特性:自注意力层实现了max型运算,而前馈层实现了词元仿射变换。