Trained Transformers Learn Linear Models In-Context

Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts.

翻译：基于注意力机制的神经网络（如Transformer）展现出惊人的上下文学习（ICL）能力：面对来自未见任务的短提示序列，无需参数更新即可生成相关逐词和下一词预测。通过将带标签的训练数据序列与无标签测试数据嵌入为提示，Transformer可模仿监督学习算法行为。近期研究表明，当在线性回归问题的随机实例上训练Transformer架构时，其预测结果与普通最小二乘法高度相似。为理解这一现象的内在机制，我们研究了单层线性自注意力Transformer在线性回归任务中通过梯度流训练时的ICL动力学特性。我们证明：尽管存在非凸性，具有适当随机初始化的梯度流仍能找到目标函数的全局最小值。在该全局最优点，当给定来自新预测任务的带标签样本测试提示时，Transformer实现的预测误差与测试提示分布上最优线性预测器相匹敌。此外，我们刻画了训练后Transformer对多种分布偏移的鲁棒性：尽管可容忍若干偏移类型，但提示协变量分布的偏移不可容忍。受此启发，我们考虑协变量分布可跨提示变化的泛化ICL场景。结果表明，尽管梯度流在此场景中仍能成功定位全局最小值，但训练后Transformer在轻微协变量偏移下依然脆弱。我们通过大规模非线性Transformer架构实验补充了这一发现，证明此类架构在协变量偏移时具有更强鲁棒性。