The transformer's emergent ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its underlying mechanisms. Existing works often study how training task diversity, defined either as the number of ICL training task vectors or as the number of function classes from which the task vectors are drawn, shapes both the learning dynamics and generalization capabilities of ICL. While both definitions have uncovered many interesting phenomena, many observations under the latter definition remain theoretically unexplained. This paper presents a minimal analytical model under which these phenomena provably emerge from the properties of the training data. By modeling the training task vectors as a mixture of low-rank Gaussians, we show how training task diversity, defined by the number of non-overlapping columns between subspaces that parameterize the covariance matrices, improves both the generalization and optimization trajectory of ICL with linear attention. In particular, we show that our model can explain (i) why training with task diversity shortens the ICL plateau and (ii) why ICL appears to achieve out-of-distribution generalization. We conclude by empirically demonstrating how our results extend to nonlinear transformers and nonlinear function classes. Overall, our work presents a tractable framework to unify existing observations.
翻译:摘要:Transformer通过上下文学习(ICL)展现出的涌现能力引发了对其潜在机制的大量研究。现有工作通常研究训练任务多样性(定义为ICL训练任务向量的数量或提取任务向量的函数类别数量)如何塑造ICL的学习动态和泛化能力。尽管两种定义都揭示了许多有趣现象,但在后一种定义下观察到的许多现象仍缺乏理论解释。本文提出了一个最小解析模型,在此模型下这些现象可从训练数据的特性中严格推导得出。通过将训练任务向量建模为低秩高斯混合分布,我们展示了以参数化协方差矩阵的子空间之间非重叠列数量所定义的训练任务多样性,如何改进线性注意力ICL的泛化性能和优化轨迹。特别地,我们证明该模型能够解释:(i)为何任务多样性训练能缩短ICL的平稳期,(ii)为何ICL似乎实现了分布外泛化。最后通过实验证明了我们的结论可推广至非线性Transformer和非线性函数类别。总体上,本文提出了一个可处理的框架来统一解释现有观察结果。