Large language models (LLMs) like transformers demonstrate impressive in-context learning (ICL) capabilities, allowing them to make predictions for new tasks based on prompt exemplars without parameter updates. While existing ICL theories often assume structured training data resembling ICL tasks (e.g., x-y pairs for linear regression), LLMs are typically trained unsupervised on unstructured text, such as web content, which lacks clear parallels to tasks like word analogy. To address this gap, we examine what enables ICL in models trained on unstructured data, focusing on critical sequence model requirements and training data structure. We find that many ICL capabilities can emerge simply from co-occurrence of semantically related word pairs in unstructured data; word analogy completion, for example, can provably arise purely through co-occurrence modeling, using classical language models like continuous bag of words (CBOW), without needing positional information or attention mechanisms. However, positional information becomes crucial for logic reasoning tasks requiring generalization to unseen tokens. Finally, we identify two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions. These findings suggest that LLMs' ICL abilities depend heavily on the structural elements within their training data.
翻译:以Transformer为代表的大型语言模型展现出令人瞩目的上下文学习能力,使其能够基于提示示例对新任务进行预测而无需参数更新。现有ICL理论通常假设训练数据具有与ICL任务相似的结构化特征(例如线性回归所需的x-y配对),但LLM通常是在缺乏明确任务类比关系的非结构化文本(如网络内容)上进行无监督训练的。为弥合这一差距,我们研究了在非结构化数据上训练的模型实现ICL的机制,重点关注序列模型的关键要求与训练数据结构。我们发现许多ICL能力可以仅通过非结构化数据中语义相关词对的共现而产生;例如词类比补全任务可被证明能纯粹通过共现建模实现——使用连续词袋模型等经典语言模型即可达成,无需位置信息或注意力机制。然而对于需要泛化到未见标记的逻辑推理任务,位置信息则变得至关重要。最后,我们识别出ICL失效的两种情形:一是需要泛化到新出现模式的逻辑推理任务,二是在相关词对仅出现在固定训练位置的类比补全任务中。这些发现表明LLM的ICL能力在很大程度上取决于其训练数据中的结构化要素。