Language models have been shown to perform better with an increase in scale on a wide variety of tasks via the in-context learning paradigm. In this paper, we investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across all of its underlying components. Using a 66 billion parameter language model (OPT-66B) across a diverse set of 14 downstream tasks, we find this is indeed the case: $\sim$70% of attention heads and $\sim$20% of feed forward networks can be removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, reinforcing arguments by Olsson et al. (arXiv:2209.11895) regarding induction head generality to more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained for in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.
翻译:语言模型已被证明,通过上下文学习范式,在多种任务上随着规模增加而表现更好。本文我们探究一个假设,即大型语言模型通过上下文学习执行任务的能力并非均匀分布于其所有底层组件。利用一个660亿参数的语言模型(OPT-66B)在14个不同下游任务上进行实验,我们证实了这一情况:约70%的注意力头和约20%的前馈网络可以被移除,而任务性能下降极小。我们发现,在不同任务和上下文示例数量中,对上下文学习(不)重要的注意力头集合存在显著重叠。我们还通过任务无关的视角验证了我们的假设,发现OPT-66B中一小部分注意力头在高分上展现了与上下文学习相关的原始归纳操作能力,即前缀匹配和复制。这些归纳头与特定任务的重要头重叠,强化了Olsson等人(arXiv:2209.11895)关于归纳头对与上下文学习相关的更复杂行为具有普适性的论点。总体而言,我们的研究提供了若干见解,表明大型语言模型在上下文学习方面可能训练不足,并引发了关于如何预训练语言模型以更有效地进行上下文学习的讨论。