Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data, especially in next-token prediction (NTP) tasks. However, the theoretical understanding of their performance in NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer. We first characterize the essential structural properties of training datasets for NTP using a mathematical framework based on partial orders. Then, we design a two-stage training algorithm, where the pre-processing stage for training the feed-forward layer and the main stage for training the attention layer exhibit fast convergence performance. Specifically, both layers converge sub-linearly to the direction of their corresponding max-margin solutions. We also show that the cross-entropy loss enjoys a linear convergence rate. Furthermore, we show that the trained transformer presents non-trivial prediction ability with dataset shift, which sheds light on the remarkable generalization performance of transformers. Our analysis technique involves the development of novel properties on the attention gradient and further in-depth analysis of how these properties contribute to the convergence of the training process. Our experiments further validate our theoretical findings.
翻译:Transformer因其在处理序列数据方面的卓越能力,在现代机器学习中取得了非凡的成功,尤其是在下一词元预测任务中。然而,目前对其在NTP任务中性能的理论理解仍有限,现有研究主要集中于渐近性能分析。本文对由自注意力模块和前馈层构成的单层Transformer的训练动态进行了细粒度的非渐近分析。我们首先基于偏序的数学框架,刻画了适用于NTP的训练数据集的基本结构特性。随后,我们设计了一种两阶段训练算法:用于训练前馈层的预处理阶段与用于训练注意力层的主体阶段均展现出快速收敛性能。具体而言,两个层均以次线性速度收敛至其对应的最大间隔解方向。我们还证明了交叉熵损失具有线性收敛速率。此外,我们表明训练后的Transformer在数据集分布偏移下仍具备非平凡的预测能力,这为Transformer卓越的泛化性能提供了理论启示。我们的分析技术涉及对注意力梯度新颖性质的推导,并深入探究了这些性质如何促进训练过程的收敛。实验进一步验证了我们的理论发现。