The emergence of In-Context Learning (ICL) in LLMs remains a significant phenomenon with little understanding. To explain ICL, recent studies try to theoretically connect it to Gradient Descent (GD). We ask, does this connection hold up in actual pre-trained models? We highlight the limiting assumptions in prior works that make their context considerably different from the practical context in which language models are trained. For example, the theoretical hand-constructed weights used in these studies have properties that don't match those of real LLMs. Furthermore, their experimental verification uses ICL objective (training models explicitly for ICL), which differs from the emergent ICL in the wild. We also look for evidence in real models. We observe that ICL and GD have different sensitivity to the order in which they observe demonstrations. Finally, we probe and compare the ICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical analyses on language models pre-trained on natural data (LLaMa-7B). Our comparisons of three performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations. We observe that ICL and GD modify the output distribution of language models differently. These results indicate that the equivalence between ICL and GD remains an open hypothesis and calls for further studies.
翻译:上下文学习(ICL)在大语言模型中的涌现仍是一个理解有限的重大现象。为解释ICL,近期研究试图从理论上将其与梯度下降(GD)建立联系。我们质疑:这种联系在真实的预训练模型中是否成立?我们指出先前研究中的限制性假设,使得其构建的上下文与语言模型实际训练环境存在显著差异。例如,这些研究中使用的理论手工构建权重不具备真实大语言模型的属性。此外,其实验验证采用ICL目标(显式训练模型执行ICL),这与自然涌现的ICL有所不同。我们也在真实模型中寻找证据,观察到ICL与GD对样本展示顺序的敏感度存在差异。最终,我们在自然场景下探究并比较ICL与GD的假设,对基于自然数据预训练的语言模型(LLaMa-7B)开展全面实证分析。基于三项性能指标的对比揭示了ICL与GD随数据集、模型类型及样本数量等要素变化时表现出的不一致性。我们观察到ICL与GD以不同方式调整语言模型的输出分布。这些结果表明ICL与GD的等价性仍是一个待验证的假设,需要进一步研究。