Can Large Language Models Infer Causation from Correlation?

Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g., commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 400K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize -- they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs' pure reasoning skills and generalizability. Our data is at https://huggingface.co/datasets/causalnlp/corr2cause. Our code is at https://github.com/causalNLP/corr2cause.

翻译：因果推断是人类智能的标志之一。尽管因果自然语言处理（CausalNLP）领域近年来吸引了大量关注，但现有的自然语言处理因果推断数据集主要依赖于从经验知识（例如常识知识）中发掘因果关系。在本文中，我们提出了首个用于测试大型语言模型纯因果推断能力的基准数据集。具体而言，我们设计了一个新颖的任务Corr2Cause，该任务输入一组相关性陈述，并确定变量之间的因果关系。我们整理了一个包含超过40万个样本的大规模数据集，并在此数据集上评估了十七个现有的大型语言模型。通过实验，我们发现了大型语言模型在因果推断能力上的一个关键缺陷，并表明这些模型在该任务上的表现几乎接近随机水平。当我们尝试通过微调重新利用大型语言模型来掌握这项技能时，这一缺陷得到了一定程度的缓解，但我们发现这些模型仍然无法泛化——它们只能在分布内设置中执行因果推断，即查询中使用的变量名和文本表达与训练集相似时；但在通过扰动这些查询生成的分布外设置中则表现失败。Corr2Cause对大型语言模型来说是一项具有挑战性的任务，将有助于指导未来提升大型语言模型纯推理能力和泛化性的研究。我们的数据可在 https://huggingface.co/datasets/causalnlp/corr2cause 获取。我们的代码可在 https://github.com/causalNLP/corr2cause 获取。