Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection

Neural sequence models based on the transformer architecture have demonstrated remarkable \emph{in-context learning} (ICL) abilities, where they can perform new tasks when prompted with training and test examples, without any parameter update to the model. This work first provides a comprehensive statistical theory for transformers to perform ICL. Concretely, we show that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, learning generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various in-context data distributions. Using an efficient implementation of in-context gradient descent as the underlying mechanism, our transformer constructions admit mild size bounds, and can be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show that transformers can implement more complex ICL procedures involving \emph{in-context algorithm selection}, akin to what a statistician can do in real life -- A \emph{single} transformer can adaptively select different base ICL algorithms -- or even perform qualitatively different tasks -- on different input sequences, without any explicit prompting of the right algorithm or task. We both establish this in theory by explicit constructions, and also observe this phenomenon experimentally. In theory, we construct two general mechanisms for algorithm selection with concrete examples: pre-ICL testing, and post-ICL validation. As an example, we use the post-ICL validation mechanism to construct a transformer that can perform nearly Bayes-optimal ICL on a challenging task -- noisy linear models with mixed noise levels. Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures.

翻译：基于Transformer架构的神经序列模型展现出卓越的上下文学习（ICL）能力——在提供训练和测试样例的提示后，无需更新模型参数即可执行新任务。本文首先为Transformer执行ICL建立了全面的统计理论。具体而言，我们证明Transformer能在上下文中实现广泛的标准机器学习算法，包括最小二乘法、岭回归、Lasso、广义线性模型学习及两层神经网络的梯度下降法，并在多种上下文数据分布上达到接近最优的预测能力。通过采用高效的上下文梯度下降实现作为底层机制，我们构建的Transformer具有适中的规模边界，且可通过多项式量级的预训练序列进行学习。基于这些“基础”ICL算法，有趣的是，我们证明Transformer能实现涉及上下文内算法选择的更复杂ICL流程——这类似于统计学家在现实生活中的做法：单个Transformer能自适应地选择不同的基础ICL算法（甚至在定性任务间切换）来处理不同输入序列，无需显式提示正确算法或任务。我们通过显式构造从理论上验证了这一现象，并在实验中观察到相同结果。理论层面，我们通过具体示例构建了两种通用算法选择机制：预ICL测试与后ICL验证。以带混合噪声水平的线性模型这一挑战性任务为例，我们利用后ICL验证机制构造了能执行近乎贝叶斯最优ICL的Transformer。实验层面，我们验证了标准Transformer架构具备强大的上下文内算法选择能力。