Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity [Aky\"urek et al., 2023], while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective [von Oswald et al., 2022]. However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of $\textit{pre-conditioned}$ GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of $\textit{nonlinear}$ functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective.
翻译:近期工作通过实证分析揭示了上下文学习机制:具有足够容量的Transformer在合成线性回归任务上可学习实现岭回归(贝叶斯最优预测器)[Aky\"urek等,2023];而配备线性自注意力且无MLP层的单层Transformer则学会在最小二乘线性回归目标上执行单步梯度下降[von Oswald等,2022]。然而这些观测背后的理论机制尚不清晰。本文从理论上研究了在合成含噪线性回归数据上训练的、具有单层线性自注意力的Transformer。首先,我们通过数学推导证明:当协变量服从标准高斯分布时,最小化预训练损失的单层Transformer将在最小二乘线性回归目标上执行单步梯度下降。继而发现:将协变量和权重向量的分布改为非各向同性高斯分布会显著影响学习到的算法——预训练损失的全局最小值此时执行单步$\textit{预条件}$梯度下降。但若仅改变响应变量的分布,则对学习算法影响甚微:即便响应变量来自更一般的$\textit{非线性}$函数族,预训练损失全局最小值仍会在最小二乘线性回归目标上执行单步梯度下降。