Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

The causal capabilities of large language models (LLMs) is a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. We further our understanding of LLMs and their causal implications, considering the distinctions between different types of causal reasoning tasks, as well as the entangled threats of construct and measurement validity. LLM-based methods establish new state-of-the-art accuracies on multiple causal benchmarks. Algorithms based on GPT-3.5 and 4 outperform existing algorithms on a pairwise causal discovery task (97%, 13 points gain), counterfactual reasoning task (92%, 20 points gain), and actual causality (86% accuracy in determining necessary and sufficient causes in vignettes). At the same time, LLMs exhibit unpredictable failure modes and we provide some techniques to interpret their robustness. Crucially, LLMs perform these causal tasks while relying on sources of knowledge and methods distinct from and complementary to non-LLM based approaches. Specifically, LLMs bring capabilities so far understood to be restricted to humans, such as using collected knowledge to generate causal graphs or identifying background causal context from natural language. We envision LLMs to be used alongside existing causal methods, as a proxy for human domain knowledge and to reduce human effort in setting up a causal analysis, one of the biggest impediments to the widespread adoption of causal methods. We also see existing causal methods as promising tools for LLMs to formalize, validate, and communicate their reasoning especially in high-stakes scenarios. In capturing common sense and domain knowledge about causal mechanisms and supporting translation between natural language and formal methods, LLMs open new frontiers for advancing the research, practice, and adoption of causality.

翻译：大型语言模型的因果能力是一个备受争议的话题，对其在医学、科学、法律和政策等社会影响深远领域的应用具有关键意义。我们进一步理解大型语言模型及其因果内涵，考虑了不同类型的因果推理任务之间的区别，以及构念效度与测量效度相互纠缠的威胁。基于大型语言模型的方法在多个因果基准上建立了新的最优准确率。基于GPT-3.5和GPT-4的算法在成对因果发现任务（准确率97%，提升13个百分点）、反事实推理任务（准确率92%，提升20个百分点）以及实际因果关系判断（在情境描述中确定必要且充分原因，准确率86%）上均优于现有算法。与此同时，大型语言模型展现出不可预测的失败模式，我们提供了一些解读其鲁棒性的技术。关键在于，大型语言模型在执行这些因果任务时，依赖于与非大型语言模型方法截然不同且互补的知识来源和方法。具体而言，大型语言模型带来了迄今为止被认为仅限于人类的能力，例如利用收集的知识生成因果图或从自然语言中识别背景因果语境。我们设想大型语言模型将与现有因果方法协同使用，作为人类领域知识的代理，并减少在建立因果分析时的人力投入——这是阻碍因果方法广泛普及的最大障碍之一。同时，我们也认为现有因果方法是大型语言模型形式化、验证和沟通其推理过程的有力工具，尤其是在高风险场景中。通过捕捉关于因果机制的常识与领域知识，并支持自然语言与形式方法之间的转换，大型语言模型为推进因果关系的研究、实践与应用开辟了新前沿。