What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

In this paper, we conduct a comprehensive study of In-Context Learning (ICL) by addressing several open questions: (a) What type of ICL estimator is learned by large language models? (b) What is a proper performance metric for ICL and what is the error rate? (c) How does the transformer architecture enable ICL? To answer these questions, we adopt a Bayesian view and formulate ICL as a problem of predicting the response corresponding to the current covariate, given a number of examples drawn from a latent variable model. To answer (a), we show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm, which is proven to be approximately parameterized by the attention mechanism. For (b), we analyze the ICL performance from an online learning perspective and establish a $\mathcal{O}(1/T)$ regret bound for perfectly pretrained ICL, where $T$ is the number of examples in the prompt. To answer (c), we show that, in addition to encoding Bayesian model averaging via attention, the transformer architecture also enables a fine-grained statistical analysis of pretraining under realistic assumptions. In particular, we prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error, where the former decays to zero exponentially as the depth grows, and the latter decays to zero sublinearly with the number of tokens in the pretraining dataset. Our results provide a unified understanding of the transformer and its ICL ability with bounds on ICL regret, approximation, and generalization, which deepens our knowledge of these essential aspects of modern language models.

翻译：本文通过解决若干未解问题，对上下文学习（ICL）展开全面研究：（a）大语言模型学习的是何种类型的ICL估计量？（b）适合ICL的性能指标是什么，其误差率如何？（c）Transformer架构如何实现ICL？为回答这些问题，我们采用贝叶斯视角，将ICL形式化为：给定从潜变量模型中抽取的若干示例后，预测当前协变量对应响应的问题。针对问题（a），我们证明在不更新神经网络参数的情况下，ICL隐式实现了贝叶斯模型平均算法，且该算法可被近似参数化为注意力机制。针对问题（b），我们从在线学习视角分析ICL性能，并为完美预训练的ICL建立了$\mathcal{O}(1/T)$遗憾界，其中$T$为提示中的示例数。针对问题（c），我们证明Transformer架构除通过注意力机制编码贝叶斯模型平均外，还能在现实假设下对预训练进行细粒度统计分析。具体而言，我们证明预训练模型的误差受限于近似误差与泛化误差之和，前者随深度增加呈指数衰减至零，后者随预训练数据集的令牌数增加呈次线性衰减至零。我们的研究结果通过ICL遗憾界、近似误差界和泛化误差界，为Transformer及其ICL能力提供了统一理解框架，深化了我们对现代语言模型这些关键方面的认知。