Scaling large language models (LLMs) leads to an emergent capacity to learn in-context from example demonstrations. Despite progress, theoretical understanding of this phenomenon remains limited. We argue that in-context learning relies on recombination of compositional operations found in natural language data. We derive an information-theoretic bound showing how in-context learning abilities arise from generic next-token prediction when the pretraining distribution has sufficient amounts of compositional structure, under linguistically motivated assumptions. A second bound provides a theoretical justification for the empirical success of prompting LLMs to output intermediate steps towards an answer. To validate theoretical predictions, we introduce a controlled setup for inducing in-context learning; unlike previous approaches, it accounts for the compositional nature of language. Trained transformers can perform in-context learning for a range of tasks, in a manner consistent with the theoretical results. Mirroring real-world LLMs in a miniature setup, in-context learning emerges when scaling parameters and data, and models perform better when prompted to output intermediate steps. Probing shows that in-context learning is supported by a representation of the input's compositional structure. Taken together, these results provide a step towards theoretical understanding of emergent behavior in large language models.
翻译:扩展大型语言模型(LLMs)的规模会带来一种涌现能力,即从示例演示中进行上下文学习。尽管取得进展,但对此现象的理论理解仍然有限。我们认为,上下文学习依赖于自然语言数据中组合操作的重新组合。我们推导出一个信息论界限,表明当预训练分布包含足够数量的组合结构时,在语言学动机假设下,上下文学习能力源于通用的下一个词预测。第二个界限为提示LLMs输出答案中间步骤的经验成功提供了理论依据。为验证理论预测,我们引入了一个可控的上下文学习诱导设置;与先前方法不同,它考虑了语言的组合性质。经过训练的Transformer能够在一系列任务中执行上下文学习,其行为与理论结果一致。在微型设置中模拟现实世界的LLMs时,上下文学习在扩展参数和数据时涌现,且当模型被提示输出中间步骤时性能更优。探测实验表明,上下文学习由输入组合结构的表示支持。综合而言,这些结果为理解大型语言模型中的涌现行为提供了理论上的进展。