Neural sequence models based on the transformer architecture have demonstrated remarkable \emph{in-context learning} (ICL) abilities, where they can perform new tasks when prompted with training and test examples, without any parameter update to the model. This work first provides a comprehensive statistical theory for transformers to perform ICL. Concretely, we show that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, learning generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various in-context data distributions. Using an efficient implementation of in-context gradient descent as the underlying mechanism, our transformer constructions admit mild size bounds, and can be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show that transformers can implement more complex ICL procedures involving \emph{in-context algorithm selection}, akin to what a statistician can do in real life -- A \emph{single} transformer can adaptively select different base ICL algorithms -- or even perform qualitatively different tasks -- on different input sequences, without any explicit prompting of the right algorithm or task. We both establish this in theory by explicit constructions, and also observe this phenomenon experimentally. In theory, we construct two general mechanisms for algorithm selection with concrete examples: pre-ICL testing, and post-ICL validation. As an example, we use the post-ICL validation mechanism to construct a transformer that can perform nearly Bayes-optimal ICL on a challenging task -- noisy linear models with mixed noise levels. Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures.
翻译:基于Transformer架构的神经序列模型展现出卓越的上下文学习能力,即通过提示训练和测试样例即可执行新任务而无需更新模型参数。本文首先为Transformer执行上下文学习建立系统的统计理论。具体而言,我们证明Transformer能够在上下文中实现广泛的标准机器学习算法,包括最小二乘法、岭回归、Lasso、广义线性模型学习以及两层神经网络的梯度下降,且在多种上下文数据分布上具有近乎最优的预测能力。通过将上下文梯度下降的有效实现作为底层机制,我们构建的Transformer具有适中的尺寸界限,并可通过多项式数量的预训练序列进行学习。基于这些"基础"上下文学习算法,有趣的是,我们进一步证明Transformer能实现更复杂的涉及上下文算法选择的上下文学习流程——类似于统计学家在现实中的行为:单个Transformer可在不同输入序列上自适应选择不同的基础上下文学习算法,甚至执行定性不同的任务,而无需显式提示正确算法或任务。我们既通过显式构造在理论上确立这一结论,也在实验中观察到该现象。在理论层面,我们通过具体示例构建了两种通用的算法选择机制:上下文前测试和上下文后验证。以带混合噪声水平的线性噪声模型这一挑战性任务为例,我们利用上下文后验证机制构建了能执行近乎贝叶斯最优上下文学习的Transformer。实验上,我们验证了标准Transformer架构具备强大的上下文算法选择能力。