On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a mesa-optimizer during autoregressive (AR) pretraining to implement ICL. Namely, the forward pass of the trained transformer is equivalent to optimizing an inner objective function in-context. However, whether the practical non-convex training dynamics will converge to the ideal mesa-optimizer is still unclear. Towards filling this gap, we investigate the non-convex dynamics of a one-layer linear causal self-attention model autoregressively trained by gradient flow, where the sequences are generated by an AR process $x_{t+1} = W x_t$. First, under a certain condition of data distribution, we prove that an autoregressively trained transformer learns $W$ by implementing one step of gradient descent to minimize an ordinary least squares (OLS) problem in-context. It then applies the learned $\widehat{W}$ for next-token prediction, thereby verifying the mesa-optimization hypothesis. Next, under the same data conditions, we explore the capability limitations of the obtained mesa-optimizer. We show that a stronger assumption related to the moments of data is the sufficient and necessary condition that the learned mesa-optimizer recovers the distribution. Besides, we conduct exploratory analyses beyond the first data condition and prove that generally, the trained transformer will not perform vanilla gradient descent for the OLS problem. Finally, our simulation results verify the theoretical results.

翻译：自回归训练的Transformer给世界带来了深刻的变革，尤其是其上下文学习（ICL）能力能够处理下游任务。最近，多项研究表明Transformer在自回归（AR）预训练过程中学习到了一个元优化器来实现ICL。也就是说，训练好的Transformer的前向传播等价于在上下文中优化一个内部目标函数。然而，实际的非凸训练动态是否会收敛到理想的元优化器仍不清楚。为填补这一空白，我们研究了由梯度流自回归训练的单层线性因果自注意力模型的非凸动态，其中序列由AR过程 $x_{t+1} = W x_t$ 生成。首先，在数据分布的特定条件下，我们证明了自回归训练的Transformer通过执行一步梯度下降来最小化上下文中的普通最小二乘（OLS）问题，从而学习 $W$。随后，它应用学习到的 $\widehat{W}$ 进行下一词预测，由此验证了元优化假说。其次，在相同数据条件下，我们探讨了所得元优化器的能力局限。我们证明，一个与数据矩相关的更强假设是学习到的元优化器恢复分布的充分必要条件。此外，我们在第一个数据条件之外进行了探索性分析，并证明在一般情况下，训练好的Transformer不会为OLS问题执行朴素的梯度下降。最后，我们的仿真结果验证了理论结果。