Large language models increasingly produce fluent causal explanations, yet they often fail in ways aggregate accuracy cannot diagnose: confusing association with intervention, abandoning correct judgments under pressure, over-refusing valid claims, or answering when evidence is underdetermined. We introduce CTK, a diagnostic benchmark of 5,147 cases and growing, across 10 domains and all three levels of Pearl's Ladder of Causation. Unlike benchmarks that only score correctness, CTK reveals why a model failed by annotating causal rung, trap type, pressure sensitivity, refusal quality, and Utility-Safety tradeoffs. Its Sheep/Wolf taxonomy separates valid causal designs from inferential traps; paired neutral/pressure variants measure sycophantic drift through Bad Flip Rate; and Wise Refusal fields test whether a model identifies the missing information needed before endorsing a claim. CTK exposes failure modes hidden by aggregate accuracy: the Skepticism Trap, Rung Collapse under scaling, pressure-induced drift, Detection-Correction gaps, and counterfactual error modes. Rather than prescribing a correction method, it provides the diagnostic substrate for studying causal-reasoning failure profiles.
翻译:大型语言模型产出的因果解释日益流畅,但其失败模式常被聚合准确率掩盖:混淆关联与干预、在压力下放弃正确判断、过度拒绝有效主张、或在证据不足时贸然作答。我们提出CTK诊断基准,包含5147个且持续扩充的案例,覆盖10个领域及珀尔因果阶梯全部三个层级。不同于仅评分正确率的基准,CTK通过标注因果阶梯层级、陷阱类型、压力敏感度、拒绝质量及效用-安全权衡,揭示模型失败原因。其绵羊/狼分类法区分有效因果设计与推理陷阱;配对中性/压力变体通过不良翻转率测量谄媚漂移;明智拒绝字段测试模型在认可主张前能否识别缺失信息。CTK暴露了被聚合准确率掩盖的失败模式:怀疑论陷阱、规模扩展下的因果阶梯坍缩、压力诱导漂移、检测-校正鸿沟及反事实误差模式。该基准不规定修正方法,而是为研究因果推理失败图谱提供诊断基础。