The transformer's remarkable ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its strengths and limitations. However, a theoretical understanding of when ICL can and cannot generalize beyond its pre-training data still remains unclear. This paper puts forth a minimal mathematical model that provably identifies when ICL can generalize out-of-distribution (OOD). By studying linear regression tasks parameterized with low-rank covariance matrices, we model distribution shifts as varying angles between subspaces and derive conditions under which a single-layer linear attention model interpolates across all angles. We show that if pre-training task vectors are drawn from a union of subspaces, transformers can generalize to all angle shifts--enabling ICL even in regions with zero probability mass in the training distribution. On the other hand, if the pre-training tasks are drawn from a single Gaussian, the test risk shows a non-negligible dependence on the angle, implying that ICL cannot generalize OOD. We empirically show that our results also hold for models such as GPT-2, and present experiments on how our results extend to nonlinear function classes.
翻译:Transformer具备卓越的上下文学习能力,这一特性催生了大量关于其优势与局限性的研究。然而,关于上下文学习何时能泛化到预训练数据之外、何时不能的理论理解仍不明确。本文提出了一个最小数学模型,可明确识别上下文学习何时能实现分布外泛化。通过研究具有低秩协方差矩阵的线性回归任务,我们将分布偏移建模为子空间之间变化的角度,并推导出单层线性注意力模型在所有角度下进行插值的条件。我们证明:若预训练任务向量来自子空间的并集,Transformer能泛化至所有角度偏移——即便在训练分布概率质量为零的区域也能实现上下文学习。反之,若预训练任务来自单一高斯分布,测试风险与角度呈现不可忽略的依赖关系,意味着上下文学习无法实现分布外泛化。实验表明,我们的结论同样适用于GPT-2等模型,并进一步展示了该结果如何拓展至非线性函数类。