Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions $z = a \, x + b \, y \;\mathrm{mod}\; p$ labeled by the vector $(a, b) \in \mathbb{Z}_p^2$. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is \emph{transient}, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing highly structured representations in both attention heads and MLPs; and discuss the learned algorithms. Notably, we find an algorithmic shift in deeper models, as we go from few to many in-context examples.
翻译:大型语言模型能够解决训练集中未出现的任务。这种能力被认为源于上下文学习与技能组合。在本研究中,我们通过一系列模算术任务探究上下文学习与技能组合的涌现机制。具体而言,我们考虑由向量 $(a, b) \in \mathbb{Z}_p^2$ 标记的有限线性模函数集合 $z = a \, x + b \, y \;\mathrm{mod}\; p$。我们使用其中部分任务进行预训练,其余任务用于分布外测试。实验表明,当预训练任务数量增加时,GPT风格的Transformer模型会表现出从分布内泛化到分布外泛化的转变。我们发现,能够实现分布外泛化的最小模型需要两个Transformer模块,而对于更深层的模型,分布外泛化阶段具有\emph{瞬态性},需要采用早停策略。最后,我们对预训练模型进行了可解释性研究,发现注意力头和多层感知器中存在高度结构化的表征,并探讨了模型学习到的算法。值得注意的是,在深层模型中,随着上下文示例从少到多的变化,我们观察到算法层面的转变。