Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.
翻译:许多推理任务要求模型基于输入上下文进行推理,从文档问答到基于规则的演绎。思维链(Chain-of-Thought, CoT)提示产生的推理轨迹看似透明,但个别步骤可能悄然偏离源证据,即使最终答案正确。现有方法仅在答案层面检测幻觉,但无法定位链中失败的具体步骤及其类型。我们提出GRACE,首个基于数据驱动错误分类法、面向上下文文本推理的人工标注步骤级忠实性基准。GRACE覆盖4个源数据集上10个模型的CoT轨迹,每个步骤均标注忠实性、错误类别及自然语言解释。通过无监督聚类自底向上发现的数据驱动分类法,将失败归为两类:GRACE-Inference(演绎错误)和GRACE-Grounding(事实依据错误),每类包含四个子类别。评估集经人工标注且具有设计上的挑战性。实验表明当前模型存在显著改进空间。此外,将步骤级忠实性信号融入强化学习流程,可同时提升下游准确率与推理可靠性。