Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

The causal capabilities of large language models (LLMs) is a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. We further our understanding of LLMs and their causal implications, considering the distinctions between different types of causal reasoning tasks, as well as the entangled threats of construct and measurement validity. LLM-based methods establish new state-of-the-art accuracies on multiple causal benchmarks. Algorithms based on GPT-3.5 and 4 outperform existing algorithms on a pairwise causal discovery task (97%, 13 points gain), counterfactual reasoning task (92%, 20 points gain), and actual causality (86% accuracy in determining necessary and sufficient causes in vignettes). At the same time, LLMs exhibit unpredictable failure modes and we provide some techniques to interpret their robustness. Crucially, LLMs perform these causal tasks while relying on sources of knowledge and methods distinct from and complementary to non-LLM based approaches. Specifically, LLMs bring capabilities so far understood to be restricted to humans, such as using collected knowledge to generate causal graphs or identifying background causal context from natural language. We envision LLMs to be used alongside existing causal methods, as a proxy for human domain knowledge and to reduce human effort in setting up a causal analysis, one of the biggest impediments to the widespread adoption of causal methods. We also see existing causal methods as promising tools for LLMs to formalize, validate, and communicate their reasoning especially in high-stakes scenarios. In capturing common sense and domain knowledge about causal mechanisms and supporting translation between natural language and formal methods, LLMs open new frontiers for advancing the research, practice, and adoption of causality.

翻译：大型语言模型（LLM）的因果推理能力是一个备受争议的问题，对LLM在医学、科学、法律和政策等对社会具有重大影响的领域中的应用具有关键意义。我们进一步理解LLM及其因果含义，考虑了不同类型因果推理任务之间的区别，以及构念效度和测量效度相互交织的威胁。基于LLM的方法在多个因果基准上取得了新的最优准确率。基于GPT-3.5和GPT-4的算法在成对因果发现任务（准确率97%，提升13个百分点）、反事实推理任务（准确率92%，提升20个百分点）以及实际因果性（在场景中确定必要和充分原因的准确率为86%）上均优于现有算法。同时，LLM表现出不可预测的失效模式，我们提供了若干技术来解释其鲁棒性。关键在于，LLM在执行这些因果任务时，依赖的知识来源和方法与非LLM方法截然不同且相互补充。具体而言，LLM带来了此前被认为仅限于人类的能力，例如利用收集的知识生成因果图，或从自然语言中识别背景因果上下文。我们设想将LLM与现有因果方法结合使用，作为人类领域知识的代理，并减少人类在建立因果分析中的工作量——而这是广泛采用因果方法的最大障碍之一。同时，我们也认为现有因果方法是LLM实现推理形式化、验证和沟通的有力工具，尤其是在高风险场景中。通过捕获关于因果机制的常识与领域知识，并支持自然语言与形式化方法之间的转换，LLM为推进因果性的研究、实践与采纳开辟了新的疆域。