Large-scale neural language models exhibit remarkable performance in in-context learning: the ability to learn and reason about the input context on the fly. This work studies in-context counterfactual reasoning in language models, that is, the ability to predict consequences of a hypothetical scenario. We focus on a well-defined, synthetic linear regression task that requires noise abduction. Accurate prediction is based on (1) inferring an unobserved latent concept and (2) copying contextual noise from factual observations. We show that language models are capable of counterfactual reasoning. Further, we enhance existing identifiability results and reduce counterfactual reasoning for a broad class of functions to a transformation on in-context observations. In Transformers, we find that self-attention, model depth and pre-training data diversity drive performance. Moreover, we provide mechanistic evidence that the latent concept is linearly represented in the residual stream and we introduce designated \textit{noise abduction heads} central to performing counterfactual reasoning. Lastly, our findings extend to counterfactual reasoning under SDE dynamics and reflect that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/mrtzmllr/iccr.
翻译:大规模神经语言模型在上下文学习中展现出卓越性能:即能够动态学习和推理输入上下文。本研究探讨语言模型中的上下文反事实推理能力,即预测假设情景后果的能力。我们聚焦于一个定义明确、需要噪声溯因的合成线性回归任务。准确预测基于两个关键因素:(1)推断未观测的潜在概念;(2)从事实观察中复制上下文噪声。我们证明语言模型具备反事实推理能力。进一步地,我们改进了现有的可识别性结果,并将广泛函数类的反事实推理简化为对上下文观察的变换操作。在Transformer模型中,我们发现自注意力机制、模型深度和预训练数据多样性是驱动性能的关键因素。此外,我们提供了机制性证据表明潜在概念在残差流中呈线性表示,并引入了专门用于执行反事实推理的\textit{噪声溯因头}。最后,我们的研究结果可扩展至随机微分方程动态下的反事实推理,表明Transformer能够在序列数据上执行噪声溯因,这为反事实故事生成的潜力提供了初步证据。代码发布于https://github.com/mrtzmllr/iccr。