CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs of semantically similar questions that reuse the same events but yield opposite causal answers, where models that rely heavily on semantic matching are systematically driven toward incorrect predictions. To further probe models' reliance on semantic patterns, we introduce a noisy-prefix evaluation that prepends causally irrelevant text before intermediate causal reasoning steps without altering the underlying causal relations or the logic of the reasoning process. We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought (CoT) supervision, and a proposed internalized causal reasoning approach that aims to mitigate explicit reliance on correlation in the reasoning process. Our results show that explicit CoT can still be misled by spurious semantic correlations, where internalizing reasoning steps yields substantially improved causal grounding, suggesting that it is promising to better elicit the latent causal reasoning capabilities of base LLMs.

翻译：随着大语言模型（LLM）在复杂高风险决策场景中的部署日益增多，将其推理能力建立在因果关系而非伪相关之上变得至关重要。然而，在传统推理基准上的优异表现并不能保证LLM具备真正的因果推理能力，因为高准确率仍可能源于对语义模式的记忆，而非对底层真实因果结构的分析。为弥合这一关键差距，我们提出了一个新的因果推理基准CausalFlip，旨在鼓励开发新的LLM范式或训练算法，使LLM推理基于因果关系而非语义相关性。CausalFlip包含基于事件三元组构建的因果判断问题，这些三元组可形成不同的混杂因子、链式和碰撞关系。在此基础上，针对每个事件三元组，我们构建了语义相似的问题对，这些问题对复用相同的事件但产生相反的因果答案，从而使得严重依赖语义匹配的模型被系统性地导向错误预测。为进一步探究模型对语义模式的依赖，我们引入了噪声前缀评估方法，即在中间因果推理步骤前添加因果无关的文本，而不改变底层因果关系或推理过程的逻辑。我们在多种训练范式下评估LLM，包括仅答案训练、显式思维链监督，以及一种旨在减轻推理过程中对相关性显式依赖的拟议内化因果推理方法。实验结果表明，显式思维链仍可能被伪语义相关性误导，而内化推理步骤能显著提升因果基础性，这表明更好地激发基础LLM的潜在因果推理能力具有广阔前景。