In causal inference, generalization capability refers to the ability to conduct causal inference methods on new data to estimate the causal-effect between unknown phenomenon, which is crucial for expanding the boundaries of knowledge. Studies have evaluated the causal inference capabilities of Large Language Models (LLMs) concerning known phenomena, yet the generalization capabilities of LLMs concerning unseen phenomena remain unexplored. In this paper, we selected four tasks: Causal Path Discovery (CP), Backdoor Adjustment (BA), Factual Inference (FI), and Counterfactual Inference (CI) as representatives of causal inference tasks. To generate evaluation questions about previously unseen phenomena in new data on the four tasks, we propose a benchmark generation framework, which employs randomly generated graphs and node names to formulate questions within hypothetical new causal scenarios. Based on this framework, we compile a benchmark dataset of varying levels of question complexity. We extensively tested the generalization capabilities of five leading LLMs across four tasks. Experiment results reveal that while LLMs exhibit good generalization performance in solving simple CP, FI, and complex CI questions, they encounter difficulties when tackling BA questions and face obvious performance fluctuations as the problem complexity changes. Furthermore, when the names of phenomena incorporate existing terms, even if these names are entirely novel, their generalization performance can still be hindered by interference from familiar terms.
翻译:在因果推断中,泛化能力指的是将因果推断方法应用于新数据以估计未知现象间因果效应的能力,这对于拓展知识边界至关重要。已有研究评估了大语言模型(LLMs)在已知现象上的因果推断能力,然而LLMs在未见现象上的泛化能力仍未得到探索。本文选取因果路径发现(CP)、后门调整(BA)、事实推断(FI)与反事实推断(CI)四项任务作为因果推断任务的代表。为了生成关于四项任务中新数据中未见现象的评价问题,我们提出了一种基准生成框架,该框架利用随机生成的图结构与节点名称,在假设的新因果场景中构建问题。基于此框架,我们编制了包含不同复杂度问题的基准数据集。我们对五种领先的LLMs在四项任务上的泛化能力进行了广泛测试。实验结果表明,尽管LLMs在解决简单的CP、FI以及复杂的CI问题上表现出良好的泛化性能,但在处理BA问题时遇到困难,且其性能随问题复杂度变化出现明显波动。此外,当现象名称包含现有术语时,即使这些名称本身完全新颖,其泛化性能仍可能受到熟悉术语的干扰影响。