Is In-Context Learning (ICL) implicitly equivalent to Gradient Descent (GD)? Several recent works draw analogies between the dynamics of GD and the emergent behavior of ICL in large language models. However, these works make assumptions far from the realistic natural language setting in which language models are trained. Such discrepancies between theory and practice, therefore, necessitate further investigation to validate their applicability. We start by highlighting the weaknesses in prior works that construct Transformer weights to simulate gradient descent. Their experiments with training Transformers on ICL objective, inconsistencies in the order sensitivity of ICL and GD, sparsity of the constructed weights, and sensitivity to parameter changes are some examples of a mismatch from the real-world setting. Furthermore, we probe and compare the ICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical analyses on language models pretrained on natural data (LLaMa-7B). Our comparisons on various performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and number of demonstrations. We observe that ICL and GD adapt the output distribution of language models differently. These results indicate that the equivalence between ICL and GD is an open hypothesis, requires nuanced considerations and calls for further studies.
翻译:上下文学习(ICL)是否隐式等同于梯度下降(GD)?近期多项研究将GD的动力学特性与大语言模型中ICL的新兴行为进行类比。然而,这些研究基于的假设与语言模型训练的真实自然语言场景存在显著差异。理论与实践的脱节亟需进一步验证其适用性。我们首先指出现有研究中构建Transformer权重以模拟梯度下降的缺陷:在ICL目标下训练Transformer的实验、ICL与GD对输入顺序敏感性的不一致性、构造权重的稀疏性以及对参数变化的敏感性,均与真实场景存在偏差。此外,我们在自然场景中对ICL与GD假说进行探测比较。通过对自然数据预训练语言模型(LLaMa-7B)开展全面实证分析,我们在多种性能指标上的对比揭示了ICL与GD在不同因素(如数据集、模型和样例数量)下表现出的不一致行为。观察表明,ICL与GD对语言模型输出分布的适应方式存在本质差异。这些结果证明,ICL与GD之间的等价性仍是一个开放假设,需要审慎考量并呼吁进一步研究。