In-context learning (ICL) of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples. In this paper, we study the efficacy of ICL from the viewpoint of statistical learning theory. We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer, pretrained on nonparametric regression tasks sampled from general function spaces including the Besov space and piecewise $\gamma$-smooth class. We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context by encoding the most relevant basis representations during pretraining. Our analysis extends to high-dimensional or sequential data and distinguishes the \emph{pretraining} and \emph{in-context} generalization gaps. Furthermore, we establish information-theoretic lower bounds for meta-learners w.r.t. both the number of tasks and in-context examples. These findings shed light on the roles of task diversity and representation learning for ICL.
翻译:大型语言模型的上下文学习(ICL)已被证明是一种仅通过少量示例即可学习新任务的惊人有效方法。本文从统计学习理论的角度研究ICL的有效性。我们为一种由深度神经网络和一个线性注意力层组成的Transformer建立了近似误差与泛化误差界,该模型在从一般函数空间(包括Besov空间和分段γ光滑函数类)采样的非参数回归任务上进行预训练。我们证明,经过充分训练的Transformer能够通过预训练过程中编码最相关的基表示,达到甚至超越上下文中的最小最大最优估计风险。我们的分析可扩展至高维或序列数据,并区分了预训练与上下文泛化差距。此外,我们针对元学习器建立了关于任务数量和上下文示例数量的信息论下界。这些发现揭示了任务多样性和表示学习对ICL的作用。