Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.
翻译:Transformer 模型具有一种卓越的能力:仅基于输入内容中提供的示例即可学习并执行任务,而无需事先进行显式训练。这种被称为上下文学习(ICL)的能力被认为是 Transformer 成功的关键,然而,关于成功实现 ICL 所需的样本复杂度、预训练任务多样性以及上下文长度等问题仍未得到解决。本文通过一个线性注意力机制执行线性回归任务的精确可解 ICL 模型,为这些问题提供了精确的答案。我们在一个现象学上丰富的标度体系中推导了学习曲线的尖锐渐近性质,其中令词元维度趋于无穷大;上下文长度和预训练任务多样性与词元维度成比例增长;而预训练样本数量则以二次方规模增长。我们展示了随着预训练样本增加而出现的双下降学习曲线,并揭示了模型在低任务多样性与高任务多样性区域之间行为的相变:在低多样性区域,模型倾向于记忆训练任务;而在高多样性区域,模型实现了真正的上下文学习,并能泛化到预训练任务范围之外。这些理论见解通过线性注意力机制以及完整的非线性 Transformer 架构的实验得到了实证验证。