Causal Evaluation of Language Models

Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. First, we propose the CaLM framework, which establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results). This taxonomy defines a broad evaluation design space while systematically selecting criteria and priorities. Second, we compose the CaLM dataset, comprising 126,334 data samples, to provide curated sets of causal targets, adaptations, metrics, and errors, offering extensive coverage for diverse research pursuits. Third, we conduct an extensive evaluation of 28 leading language models on a core set of 92 causal targets, 9 adaptations, 7 metrics, and 12 error types. Fourth, we perform detailed analyses of the evaluation results across various dimensions (e.g., adaptation, scale). Fifth, we present 50 high-level empirical findings across 9 dimensions (e.g., model), providing valuable guidance for future language model development. Finally, we develop a multifaceted platform, including a website, leaderboards, datasets, and toolkits, to support scalable and adaptable assessments. We envision CaLM as an ever-evolving benchmark for the community, systematically updated with new causal targets, adaptations, models, metrics, and error types to reflect ongoing research advancements. Project website is at https://opencausalab.github.io/CaLM.

翻译：因果推理被视为实现人类级机器智能的关键。近年来语言模型的进展扩展了人工智能在各领域的边界，促使学界探究其因果推理潜能。本研究提出语言模型因果评估基准（CaLM），据我们所知这是首个全面评估语言模型因果推理能力的基准。首先，我们提出CaLM框架，建立由四个模块构成的基础分类体系：因果目标（评估对象）、适配方法（结果获取方式）、度量指标（结果量化方式）与误差分析（不良结果分析方法）。该分类体系在系统化筛选标准与优先级的同时，定义了广泛的评估设计空间。其次，我们构建包含126,334个数据样本的CaLM数据集，提供经过整理的因果目标、适配方法、度量指标与误差类型集合，全面覆盖多样化研究需求。第三，我们针对核心测试集（92个因果目标、9种适配方法、7项度量指标、12种误差类型）对28个主流语言模型进行广泛评估。第四，我们从适配方法、模型规模等多维度对评估结果进行详细分析。第五，我们提出涵盖模型等9个维度的50项高层次实证发现，为未来语言模型发展提供重要指导。最后，我们开发包含网站、排行榜、数据集与工具包的多维平台，支持可扩展与自适应的评估。我们期望CaLM成为持续演进的社区基准，通过系统性更新因果目标、适配方法、模型、度量指标及误差类型，反映最新研究进展。项目网站：https://opencausalab.github.io/CaLM。