EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science

Socio-economic causal effects depend heavily on their specific institutional and environmental context. A single intervention can produce opposite results depending on regulatory or market factors, contexts that are often complex and only partially observed. This poses a significant challenge for large language models (LLMs) in decision-support roles: can they distinguish structural causal mechanisms from surface-level correlations when the context changes? To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals. Through a rigorous four-stage pipeline combining multi-run consensus, context refinement, and multi-critic filtering, we ensure each claim is grounded in peer-reviewed research with explicit identification strategies. Our evaluation reveals critical limitations in current LLMs' context-dependent reasoning. While top models achieve approximately 88 percent accuracy in fixed, explicit contexts, performance drops sharply under context shifts, with a 32.6 percentage point decline, and falls to 37 percent when misinformation is introduced. Furthermore, models exhibit severe over-commitment in ambiguous cases and struggle to recognize null effects, achieving only 9.5 percent accuracy, exposing a fundamental gap between pattern matching and genuine causal reasoning. These findings underscore substantial risks for high-stakes economic decision-making, where the cost of misinterpreting causality is high. The dataset and benchmark are publicly available at https://github.com/econaikaist/econcausal-benchmark.

翻译：社会经济因果效应高度依赖于其特定的制度与环境背景。同一干预措施可能因监管或市场因素产生截然相反的结果，而这些背景往往复杂且仅被部分观测。这对承担决策支持角色的大语言模型提出了重大挑战：当背景发生变化时，它们能否区分结构性因果机制与表面相关性？为此，我们提出了EconCausal，一个大规模基准数据集，包含从2595项发表于顶尖经济与金融期刊的高质量实证研究中提取的10,490个带上下文标注的因果三元组。通过结合多轮共识、上下文精炼与多评判过滤的严格四阶段流程，我们确保每个论断均基于同行评议研究，并具有明确的识别策略。我们的评估揭示了当前大语言模型在上下文依赖推理方面的关键局限：尽管顶尖模型在固定、明确的背景下能达到约88%的准确率，但在背景变化下性能急剧下降，降幅达32.6个百分点；当引入错误信息时，准确率更降至37%。此外，模型在模糊情境下表现出严重的过度确信，且难以识别零效应，准确率仅为9.5%，这暴露了模式匹配与真实因果推理之间的根本差距。这些发现凸显了高风险经济决策中的重大风险，其中误解因果关系的代价高昂。本数据集与基准已公开于https://github.com/econaikaist/econcausal-benchmark。