Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.
翻译:摘要:大语言模型在CLadder等因果推理基准测试中达到50%至70%的准确率,但尚不明确这反映的是结构性推理还是词汇模式匹配。我们提出Caliper——一种受控扰动方法,在保留每个问题因果图与概率设定的前提下,将语义变量名称替换为占位符标记。针对九个指令微调的大语言模型(参数规模从3.8B至671B)及三个因果推理基准测试,词汇匿名化使得局部3.8B-14B模型组的准确率稳健下降分别达+7.6、+27.0和+11.1个百分点;在涵盖2024-2026代际的九个前沿模型上,CRASS和e-CARE基准的准确率降幅更分别升至+29.6和+18.0个百分点。在40个模型-基准交叉组合中,39个呈现出正向差距,而该差距在CLadder的伪词子集上缩小了17倍。结构化脚手架与少样本上下文学习均能缩小差距,但主要通过降低较小模型在P0上的准确率实现,而非提升P1。当前指令微调大语言模型在零样本评估下,一旦移除词汇锚点,便几乎不展现结构性因果推理的证据。