The ability to understand causality significantly impacts the competence of large language models (LLMs) in output explanation and counterfactual reasoning, as causality reveals the underlying data distribution. However, the lack of a comprehensive benchmark currently limits the evaluation of LLMs' causal learning capabilities. To fill this gap, this paper develops CausalBench based on data from the causal research community, enabling comparative evaluations of LLMs against traditional causal learning algorithms. To provide a comprehensive investigation, we offer three tasks of varying difficulties, including correlation, causal skeleton, and causality identification. Evaluations of 19 leading LLMs reveal that, while closed-source LLMs show potential for simple causal relationships, they significantly lag behind traditional algorithms on larger-scale networks ($>50$ nodes). Specifically, LLMs struggle with collider structures but excel at chain structures, especially at long-chain causality analogous to Chains-of-Thought techniques. This supports the current prompt approaches while suggesting directions to enhance LLMs' causal reasoning capability. Furthermore, CausalBench incorporates background knowledge and training data into prompts to thoroughly unlock LLMs' text-comprehension ability during evaluation, whose findings indicate that, LLM understand causality through semantic associations with distinct entities, rather than directly from contextual information or numerical distributions.
翻译:因果理解能力对大语言模型(LLMs)在输出解释与反事实推理方面的表现具有重要影响,因为因果关系揭示了底层数据分布规律。然而,当前缺乏综合性基准限制了LLMs因果学习能力的系统评估。为填补这一空白,本文基于因果研究领域的数据构建了CausalBench,实现了LLMs与传统因果学习算法的对比评估。为进行全面探究,我们设计了三个不同难度的任务,包括相关性分析、因果骨架识别与因果关系判定。对19个主流LLMs的评估表明:尽管闭源LLMs在简单因果关系中展现出潜力,但在更大规模网络(>50节点)上的表现显著落后于传统算法。具体而言,LLMs在处理对撞结构时存在困难,但在链式结构(尤其是类似思维链技术的长链因果关系)中表现优异。这既验证了当前提示方法的有效性,也为增强LLMs因果推理能力指明了方向。此外,CausalBench将背景知识与训练数据融入提示设计,在评估中充分释放LLMs的文本理解能力。研究发现:LLMs通过特定实体的语义关联来理解因果关系,而非直接从上下文信息或数值分布中获取。