We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}\text{-}\mathbf{\beta}$), in the sense that every $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator. Finally, we show that $\mathsf{GD}\text{-}\mathbf{\beta}$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}\text{-}\mathbf{\beta}$, and they highlight the role of MLP layers in reducing approximation error.
翻译:我们研究了一个结合线性注意力组件和线性多层感知机(MLP)组件的线性Transformer模块的上下文学习能力。针对具有高斯先验和非零均值的线性回归的上下文学习问题,我们证明该线性Transformer模块能够达到近乎贝叶斯最优的上下文学习风险。相比之下,仅使用线性注意力必然会产生不可约的加性逼近误差。此外,我们建立了线性Transformer模块与具有可学习初始化的一步梯度下降估计器(GD-β)之间的对应关系:每个GD-β估计器都可由一个线性Transformer模块估计器实现,而每个最小化类内上下文学习风险的最优线性Transformer模块估计器实际上都是一个GD-β估计器。最后,我们证明尽管训练目标非凸,GD-β估计器仍可通过梯度流高效优化。我们的结果表明,线性Transformer模块通过实现GD-β来完成上下文学习,并揭示了MLP层在降低逼近误差中的关键作用。