In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Unlike prior work that explores implicit mechanisms behind ICL, we study ICL via investigating the pretraining data. Specifically, we first adapt an iterative, gradient-based approach to find a small subset of pretraining data that supports ICL. We observe that a continued pretraining on this small subset significantly improves the model's ICL ability, by up to 18%. We then compare the supportive subset constrastively with random subsets of pretraining data and discover: (1) The supportive pretraining data to ICL do not have a higher domain relevance to downstream tasks. (2) The supportive pretraining data have a higher mass of rarely occurring, long-tail tokens. (3) The supportive pretraining data are challenging examples where the information gain from long-range context is below average, indicating learning to incorporate difficult long-range context encourages ICL. Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data. Our insights have a potential to enhance the ICL ability of language models by actively guiding the construction of pretraining data in the future.
翻译:上下文学习通过推理时展示少量示例即可提升语言模型在多种自然语言处理任务上的性能。由于模型从未针对此类演示进行专门训练,其涌现机制尚未明确。与先前探索上下文学习隐含机制的研究不同,我们通过分析预训练数据来研究该能力。具体而言,我们首先采用迭代梯度方法从预训练数据中筛选出支持上下文学习的小型子集。实验表明,在此子集上继续预训练可将模型的上下文学习能力提升高达18%。通过对比支持性子集与随机子集,我们发现:(1)支持上下文学习的预训练数据与下游任务不存在更高的领域相关性;(2)支持性子集中包含更多低频长尾词元;(3)支持性子集属于信息增益低于平均水平的困难样本——其上下文信息增益低于均值,这表明学习融合困难的长距离上下文能够促进上下文学习。本研究首次从实例级预训练数据维度探索上下文学习机制,相关发现为未来通过主动构建预训练数据增强语言模型的上下文学习能力提供了理论支撑。