This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.
翻译:本文针对上下文学习(ICL)建立了一套有限样本统计理论,在可容纳异质任务类型混合的元学习框架下进行分析。我们引入一种原理性风险分解方法,将总体ICL风险分解为两个正交分量:贝叶斯差距与后验方差。其中贝叶斯差距量化了训练模型逼近贝叶斯最优上下文预测器的程度。针对均匀注意力Transformer,我们推导出该差距的非渐近上界,明确揭示了其与预训练提示数量及上下文长度之间的依赖关系。后验方差则是一个表示任务固有不确定性的模型无关风险。我们的核心发现是:该分量完全由真实底层任务的难度决定,而任务混合带来的不确定性仅需少量上下文样本即可指数级衰减。综合而言,这些结果为ICL提供了统一视角:Transformer在预训练阶段选择最优元算法,并在测试阶段快速收敛至真实任务的最优算法。