The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insight into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.
翻译:因果推理能力被广泛认为是智能的核心特征之一。本文研究大型语言模型(LLMs)是否能够连贯地推理因果关系。现有自然语言处理(NLP)工作多聚焦于评估LLMs的常识因果推理,未能检验模型是否遵循一套明确定义的正式规则进行因果推断。针对此问题,我们受Judea Pearl等人提出的“因果推断引擎”启发,提出一项新的NLP任务——自然语言因果推断。我们构建了包含10K样本的大型数据集CLadder:基于因果图与查询(关联性、干预性、反事实性),通过因果推断引擎生成符号化问题与真实答案,并将其转化为自然语言。我们在该数据集上评估了多个LLMs,并引入并评估了一种定制化思维链提示策略CausalCoT。研究表明,我们的任务对LLMs极具挑战性,通过深入分析进一步揭示了LLMs的因果推理能力。数据集已开源至https://huggingface.co/datasets/causalNLP/cladder,代码见https://github.com/causalNLP/cladder。