Large language models (LLMs) have initiated a paradigm shift in transfer learning. In contrast to the classic pretraining-then-finetuning procedure, in order to use LLMs for downstream prediction tasks, one only needs to provide a few demonstrations, known as in-context examples, without adding more or updating existing model parameters. This in-context learning (ICL) capabilities of LLMs is intriguing, and it is not yet fully understood how pretrained LLMs acquire such capabilities. In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training on a general language corpus by proposing one hypothesis that LLMs can simulate kernel regression algorithms when faced with in-context examples. More concretely, we first prove that Bayesian inference on in-context prompts can be asymptotically understood as kernel regression $\hat y = \frac{\sum_i y_i K(x, x_i)}{\sum_i K(x, x_i)}$ as the number of in-context demonstrations grows. Then, we empirically investigate the in-context behaviors of language models. We find that during ICL, the attentions and hidden features in LLMs match the behaviors of a kernel regression. Finally, our theory provides insights on multiple phenomena observed in ICL field: why retrieving demonstrative samples similar to test sample can help, why ICL performance is sensitive to the output formats, and why ICL accuracy benefits from selecting in-distribution and representative samples. We will make our code available to the research community following publication.
翻译:大型语言模型(LLMs)引发了迁移学习的范式转变。与经典的预训练-微调流程不同,使用LLMs进行下游预测任务时,只需提供少量示范(称为上下文示例),而无需添加或更新现有模型参数。LLMs的这种上下文学习(ICL)能力令人费解,目前尚未完全理解预训练LLMs如何获得这种能力。本文通过提出一个假设,即基于Transformer的语言模型在通用语言语料库上预训练后,面对上下文示例时能够模拟核回归算法,从而探究其实现上下文学习的原因。具体而言,我们首先证明:随着上下文示范数量的增加,对上下文提示的贝叶斯推断可以渐近地理解为核回归 $\hat y = \frac{\sum_i y_i K(x, x_i)}{\sum_i K(x, x_i)}$。接着,我们通过实验研究语言模型的上下文行为,发现ICL过程中LLMs的注意力机制和隐藏特征与核回归的行为相匹配。最后,我们的理论为ICL领域观察到的多个现象提供了见解:为何检索与测试样本相似的示范样本能提升性能,为何ICL性能对输出格式敏感,以及为何选择分布内和代表性样本能提升ICL准确性。我们将在论文发表后向研究社区公开代码。