Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

翻译：大型语言模型（LLM）能够通过微调获取新知识，但这一过程呈现出一种令人困惑的双重性：模型既能从新事实中表现出显著的泛化能力，又容易产生错误信息的幻觉。然而，这一现象的原因至今尚未得到充分理解。本文认为，这两种行为均源于一种被称为上下文外推理（OCR）的单一机制：即通过关联概念（即使这些概念之间不存在因果关系）来推断隐含信息的能力。我们在五个主流LLM上的实验证实，OCR确实同时驱动着泛化与幻觉行为，其具体表现取决于所关联的概念之间是否存在因果关系。为了建立对这一现象的严格理论理解，我们将OCR形式化为一种合成事实召回任务。实验表明，采用分解输出矩阵与值矩阵的单层单头纯注意力Transformer能够学习解决该任务，而使用合并权重的模型则无法实现，这凸显了矩阵分解的关键作用。我们的理论分析表明，OCR能力可归因于梯度下降的隐式偏好，其倾向于最小化合并输出-值矩阵的核范数的解。这种数学结构解释了为何模型能够以高样本效率学习事实与隐含信息之间的关联，无论这种关联是因果性的还是仅仅为虚假相关性。最终，本研究为理解OCR现象提供了理论基础，为分析和缓解知识注入过程中的不良行为提供了新的视角。