Pretrained language models (LMs) can generalize to implications of facts that they are finetuned on. For example, if finetuned on ``John Doe lives in Tokyo," LMs can correctly answer ``What language do the people in John Doe's city speak?'' with ``Japanese''. However, little is known about the mechanisms that enable this generalization or how they are learned during pretraining. We introduce extractive structures as a framework for describing how components in LMs (e.g., MLPs or attention heads) coordinate to enable this generalization. The structures consist of informative components that store training facts as weight changes, and upstream and downstream extractive components that query and process the stored information to produce the correct implication. We hypothesize that extractive structures are learned during pretraining when encountering implications of previously known facts. This yields two predictions: a data ordering effect where extractive structures can be learned only if facts precede their implications, and a weight grafting effect where extractive structures can be transferred to predict counterfactual implications. We empirically demonstrate these phenomena in the OLMo-7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b models. Of independent interest, our results also indicate that fact learning can occur at both early and late layers, which lead to different forms of generalization.
翻译:预训练语言模型(LMs)能够泛化到其微调所用事实的隐含信息。例如,若在“约翰·多伊居住在东京”这一事实上进行微调,LMs 能够正确回答“约翰·多伊所在城市的人们说什么语言?”为“日语”。然而,关于实现这种泛化的机制及其在预训练过程中如何习得,目前所知甚少。我们引入提取结构作为一个框架,用以描述 LMs 中的组件(例如 MLP 或注意力头)如何协同工作以实现这种泛化。这些结构由存储训练事实作为权重变化的**信息组件**,以及查询和处理所存储信息以产生正确隐含结果的**上游与下游提取组件**构成。我们假设提取结构是在预训练过程中,当模型遇到先前已知事实的隐含信息时习得的。这引出了两个预测:一是**数据顺序效应**,即只有当事实先于其隐含信息出现时,提取结构才能被习得;二是**权重嫁接效应**,即提取结构可以迁移用于预测反事实的隐含信息。我们在 OLMo-7b、Llama 3-8b、Gemma 2-9b 和 Qwen 2-7b 模型上实证验证了这些现象。独立来看,我们的结果还表明事实学习可以发生在早期层和晚期层,这导致了不同形式的泛化。