LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl's Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K's value for advancing trustworthy reasoning systems. Repository: https://github.com/genglongling/CausalT5kBench
翻译:大型语言模型在因果推理中的失败,包括迎合行为、梯级坍塌和校准错误的拒绝,已有充分记录,但由于缺乏能够进行系统性诊断的基准,其改进进展缓慢。我们提出了CausalT5K,这是一个包含10个领域、超过5000个案例的诊断性基准,用于测试三个关键能力:(1) 检测梯级坍塌,即模型使用关联性证据回答干预性查询;(2) 在对抗性压力下抵抗迎合性偏移;(3) 在证据不确定时生成指明缺失信息的明智拒绝。与合成基准不同,CausalT5K将因果陷阱嵌入现实叙事中,并将性能分解为效用(敏感性)和安全(特异性),从而揭示聚合准确率无法发现的故障模式。CausalT5K通过严格的人机协作流程开发,涉及40位领域专家、迭代交叉验证循环,以及基于规则、LLM和人工评分的复合验证,它将Pearl的因果阶梯实现为研究基础设施。初步实验揭示了一个四象限控制图景,其中静态审计策略普遍失效,这一发现证明了CausalT5K对于推进可信推理系统的价值。代码库:https://github.com/genglongling/CausalT5kBench