Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs' responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.

翻译：可靠的因果推断对于医学、经济学和公共政策等高风险领域的决策至关重要。然而，目前尚不清楚大语言模型（LLMs）能否处理严谨且可信的统计因果推断。现有基准测试通常涉及简化的任务，例如仅要求LLMs识别语义上的因果关系或直接从原始数据得出结论。这可能导致模型忽略重要的统计陷阱，如辛普森悖论或选择偏差，从而限制了LLMs在现实世界中的适用性。为应对这些局限性，我们提出了CausalPitfalls——一个旨在严格评估LLMs克服常见因果推断陷阱能力的综合性基准测试。该基准设计了跨多个难度级别的结构化挑战，并配有评分标准，从而能够定量衡量LLMs的因果推理能力及其回答的可靠性。我们采用两种协议评估模型：（1）直接提示，用于评估内在的因果推理能力；（2）代码辅助提示，即模型生成可执行代码以进行明确的统计分析。此外，我们通过将该基准的评分与人类专家的评估结果进行对比，验证了其有效性。我们的研究结果揭示了当前LLMs在执行统计因果推断时存在显著局限性。CausalPitfalls基准为推进可信因果推理系统的发展提供了必要的指导和量化指标。