Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size $n$, and (ii) the optimization error is independent of the sequence length $T$. This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with $T$. The trade-off is memory: to keep the full context, the Transformer's memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and confirm the predicted scaling laws for Transformers.
翻译:理解Transformer为何表现如此优异,因其非凸优化特性而颇具挑战。本研究分析了在核机制下通过投影梯度下降训练的具有$m$个独立注意力头的浅层Transformer。我们的分析揭示了两项主要发现:(i) 获得非渐进性保证所需的网络宽度仅随样本量$n$呈对数尺度增长;(ii) 优化误差与序列长度$T$无关。这与循环神经网络架构形成鲜明对比——后者的优化误差可能随$T$呈指数级增长。代价在于内存需求:为保持完整上下文,Transformer的内存需求随序列长度增长。我们在师生框架下通过数值实验验证了理论结果,并证实了Transformer的尺度律预测。