Can Large Language Models Infer Causation from Correlation?

Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g., commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize -- they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs' pure reasoning skills and generalizability. Our data is at https://huggingface.co/datasets/causalnlp/corr2cause. Our code is at https://github.com/causalNLP/corr2cause.

翻译：因果推断是人类智能的标志之一。尽管因果自然语言处理（CausalNLP）领域近年来吸引了广泛关注，但现有自然语言处理领域的因果推断数据集主要依赖于从经验知识（如常识知识）中发现因果关系。在本工作中，我们提出了首个用于测试大型语言模型（LLMs）纯因果推断能力的基准数据集。具体而言，我们定义了一个新任务Corr2Cause，该任务输入一组相关性陈述，并确定变量之间的因果关系。我们整理了一个包含超过20万样本的大规模数据集，并在该数据集上评估了十七种现有大型语言模型。通过实验，我们发现了大型语言模型在因果推断能力方面的一个关键缺陷，并表明这些模型在该任务上的表现几乎接近随机水平。当我们尝试通过微调让大型语言模型重新适配这种技能时，这一缺陷得到了一定程度的缓解，但我们发现这些模型仍然无法泛化——它们只能在分布内设置（即查询中使用的变量名称和文本表达与训练集相似）下执行因果推断，但在由扰动这些查询生成的分布外设置中则失败。Corr2Cause对大型语言模型而言是一项具有挑战性的任务，将有助于指导未来提升大型语言模型纯推理能力和泛化性的研究。我们的数据详见https://huggingface.co/datasets/causalnlp/corr2cause。我们的代码详见https://github.com/causalNLP/corr2cause。