The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.
翻译:因果推理能力被广泛视为智力的核心特征。本研究探讨大型语言模型(LLMs)能否连贯地进行因果推理。现有自然语言处理(NLP)工作主要集中于评估LLMs的常识性因果推理,因此未能检验模型是否遵循一套明确定义的形式规则进行因果推断。为解决此问题,我们受Judea Pearl等人提出的“因果推断引擎”启发,提出一项新NLP任务——自然语言因果推断。我们构建了包含10K样本的大型数据集CLadder:基于因果图与查询(关联性、干预性及反事实性)集合,通过神谕因果推断引擎生成符号化问题及其标准答案,并将其转化为自然语言表述。我们在该数据集上评估了多个LLMs,并引入及评估了定制化思维链提示策略CausalCoT。研究表明,我们的任务对LLMs极具挑战性,并通过深度分析进一步揭示了LLMs的因果推理能力。我们已将数据开源至https://huggingface.co/datasets/causalNLP/cladder,代码发布于https://github.com/causalNLP/cladder。