In this paper, we conduct a comprehensive study of In-Context Learning (ICL) by addressing several open questions: (a) What type of ICL estimator is learned within language models? (b) What are suitable performance metrics to evaluate ICL accurately and what are the error rates? (c) How does the transformer architecture enable ICL? To answer (a), we take a Bayesian view and demonstrate that ICL implicitly implements the Bayesian model averaging algorithm. This Bayesian model averaging algorithm is proven to be approximately parameterized by the attention mechanism. For (b), we analyze the ICL performance from an online learning perspective and establish a regret bound $\mathcal{O}(1/T)$, where $T$ is the ICL input sequence length. To address (c), in addition to the encoded Bayesian model averaging algorithm in attention, we show that during pertaining, the total variation distance between the learned model and the nominal model is bounded by a sum of an approximation error and a generalization error of $\tilde{\mathcal{O}}(1/\sqrt{N_{\mathrm{p}}T_{\mathrm{p}}})$, where $N_{\mathrm{p}}$ and $T_{\mathrm{p}}$ are the number of token sequences and the length of each sequence in pretraining, respectively. Our results provide a unified understanding of the transformer and its ICL ability with bounds on ICL regret, approximation, and generalization, which deepens our knowledge of these essential aspects of modern language models.
翻译:本文通过解决若干开放性问题,对上下文学习(In-Context Learning, ICL)进行了全面研究:(a)语言模型学习的是何种ICL估计量?(b)准确评估ICL应使用哪些合适的性能指标及其误差率如何?(c)Transformer架构如何实现ICL?针对问题(a),我们采用贝叶斯视角,证明ICL隐式实现了贝叶斯模型平均算法,并进一步证实该算法可通过注意力机制近似参数化。针对问题(b),我们从在线学习角度分析ICL性能,建立了遗憾界$\mathcal{O}(1/T)$,其中$T$为ICL输入序列长度。针对问题(c),除注意力机制中编码的贝叶斯模型平均算法外,我们证明预训练过程中,学习模型与标称模型之间的总变差距离受近似误差与泛化误差之和约束,该泛化误差为$\tilde{\mathcal{O}}(1/\sqrt{N_{\mathrm{p}}T_{\mathrm{p}}})$,其中$N_{\mathrm{p}}$和$T_{\mathrm{p}}$分别为预训练中的令牌序列数量与每个序列的长度。我们的研究结果通过ICL遗憾界、近似误差与泛化误差的约束,为Transformer及其ICL能力提供了统一理解,深化了对现代语言模型这些核心方面的认知。