Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.
翻译:Transformer具有基于输入内提供的示例学习并执行任务的卓越能力,无需显式预训练。这种被称为上下文学习的能力被认为是Transformer成功的关键基石,但其成功所需的样本复杂度、预训练任务多样性及上下文长度等问题仍未解决。本文通过线性注意力在线性回归任务的精确可解ICL模型中,为这些问题提供了精确解答。我们在一个现象学丰富的标度区域推导了学习曲线的尖锐渐近表达式:其中词元维度趋于无穷大;上下文长度和预训练任务多样性与词元维度成比例标度;预训练样本数量按二次方标度增长。我们展示了随预训练样本增加而呈现的双下降学习曲线,并揭示了模型行为在低任务多样性和高任务多样性区域之间的相变:在低多样性区域,模型趋向于记忆训练任务;而在高多样性区域,模型实现了真正的上下文学习,并超越预训练任务范围进行泛化。这些理论见解通过线性注意力和完整非线性Transformer架构的实验得到了实证验证。