Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that explicitly decomposes faithfulness into two complementary dimensions: causality (whether each step logically follows from prior context) and coverage (whether essential intermediate inferences are present). Using controlled perturbations, we construct examples with known causal error positions by replacing a single step with a logically inconsistent variant, and with controlled coverage deletions at varying rates, enabling direct measurement against reference labels. We evaluate three frontier LLM judges across three tasks: binary causal detection, causal step localization, and coverage scoring. Our results reveal that judge reliability is highly task-dependent, with no single model dominating across settings. While models often detect that an error exists, they struggle to accurately localize it, indicating a substantial gap between detection and attribution. Moreover, all judges systematically overestimate reasoning completeness, assigning high coverage scores even when substantial portions of intermediate reasoning are missing. These findings expose fundamental limitations of LLM judges in process-level evaluation and highlight the need for more reliable and calibrated methods when using LLMs to assess reasoning quality.
翻译:大型语言模型(LLMs)正越来越多地被用作思维链(CoT)推理的评估裁判,然而它们是否能够可靠地评估过程忠实性(而非仅仅回答合理性)尚不明确。我们提出C2-Faith——一个基于PRM800K构建的基准数据集,明确将忠实性分解为两个互补维度:因果性(每一步是否在逻辑上遵循先前上下文)和覆盖性(关键中间推论是否存在)。通过受控扰动,我们构造了已知因果错误位置的样本(将单一步骤替换为逻辑不一致的变体),以及不同比率的受控覆盖缺失样本,从而能够直接对照参考标签进行测量。我们评估了三个前沿LLM裁判在三项任务(二元因果检测、因果步骤定位、覆盖性评分)中的表现。结果表明,裁判的可靠性高度依赖于任务,没有单一模型能在所有设置中占据主导地位。虽然模型常能检测到错误的存在,但难以准确定位,表明检测与归因之间存在显著差距。此外,所有裁判系统性地高估了推理完整性,即便在中间推理的大幅缺失时仍给出较高的覆盖性评分。这些发现揭示了LLM裁判在过程级评估中的根本局限性,并突显了在使用LLM评估推理质量时,需要更可靠且经过校准的方法。