Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a simple next token prediction task, where sequences are generated as a first-order autoregressive process $s_{t+1} = W s_t$. We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping. We call the resulting procedure in-context autoregressive learning. More precisely, focusing on commuting orthogonal matrices $W$, we first show that a trained one-layer linear Transformer implements one step of gradient descent for the minimization of an inner objective function, when considering augmented tokens. When the tokens are not augmented, we characterize the global minima of a one-layer diagonal linear multi-head Transformer. Importantly, we exhibit orthogonality between heads and show that positional encoding captures trigonometric relations in the data. On the experimental side, we consider the general case of non-commuting orthogonal matrices and generalize our theoretical findings.
翻译:Transformer在语言建模任务中已取得最先进的性能,但其巨大成功背后的原因仍不明确。为更好地理解这一机制,本文在简单下一token预测任务上训练了Transformer模型,其中序列由一阶自回归过程$s_{t+1} = W s_t$生成。我们展示了训练后的Transformer如何通过先上下文学习$W$,再应用预测映射来预测下一token,并将这一过程称为上下文自回归学习。具体而言,聚焦于可交换正交矩阵$W$,我们首先证明:在考虑增强token时,训练后的单层线性Transformer通过最小化内部目标函数实现了一步梯度下降。当token未增强时,我们刻画了单层对角线性多头Transformer的全局最小值。关键地,我们揭示了注意力头之间的正交性,并表明位置编码捕捉了数据中的三角关系。在实验层面,我们考虑非可交换正交矩阵的一般情形,并推广了理论发现。