The tremendous success of Transformer models in fields such as large language models and computer vision necessitates a rigorous theoretical investigation. To the best of our knowledge, this paper is the first work proving that standard Transformers can approximate Hölder functions $ C^{s,λ}\left([0,1]^{d\times n}\right) $$ (s\in\mathbb{N}_{\geq0},0<λ\leq1) $ under the $L^t$ distance ($t \in [1, \infty]$) with arbitrary precision. Building upon this approximation result, we demonstrate that standard Transformers achieve the minimax optimal rate in nonparametric regression for Hölder target functions. It is worth mentioning that, by introducing two metrics: the size tuple and the dimension vector, we provide a fine-grained characterization of Transformer structures, which facilitates future research on the generalization and optimization errors of Transformers with different structures. As intermediate results, we also derive the upper bounds for the Lipschitz constant of standard Transformers and their memorization capacity, which may be of independent interest. These findings provide theoretical justification for the powerful capabilities of Transformer models.
翻译:Transformer模型在大型语言模型和计算机视觉等领域的巨大成功亟需严格的理论研究。据我们所知,本文首次证明了标准Transformer能以任意精度在$L^t$距离($t \in [1, \infty]$)下逼近Hölder函数$ C^{s,λ}\left([0,1]^{d\times n}\right) $$ (s\in\mathbb{N}_{\geq0},0<λ\leq1) $。基于该逼近结果,我们证明了标准Transformer在Hölder目标函数的非参数回归中达到了极小极大最优速率。值得指出的是,通过引入规模元组与维度向量两个度量指标,我们对Transformer结构进行了细粒度刻画,这为未来研究不同结构Transformer的泛化误差与优化误差提供了便利。作为中间结果,我们还推导了标准Transformer的Lipschitz常数上界及其记忆容量,这些结论可能具有独立的研究价值。上述发现为Transformer模型的强大能力提供了理论依据。