Multi-step reasoning ability is fundamental to many natural language tasks, yet it is unclear what constitutes a good reasoning chain and how to evaluate them. Most existing methods focus solely on whether the reasoning chain leads to the correct conclusion, but this answer-oriented view may confound reasoning quality with other spurious shortcuts to predict the answer. To bridge this gap, we evaluate reasoning chains by viewing them as informal proofs that derive the final answer. Specifically, we propose ReCEval (Reasoning Chain Evaluation), a framework that evaluates reasoning chains via two key properties: (1) correctness, i.e., each step makes a valid inference based on information contained within the step, preceding steps, and input context, and (2) informativeness, i.e., each step provides new information that is helpful towards deriving the generated answer. We evaluate these properties by developing metrics using natural language inference models and V-Information. On multiple datasets, we show that ReCEval effectively identifies various error types and yields notable improvements compared to prior methods. We analyze the impact of step boundaries, and previous steps on evaluating correctness and demonstrate that our informativeness metric captures the expected flow of information in high-quality reasoning chains. Finally, we show that scoring reasoning chains based on ReCEval improves downstream task performance. Our code is publicly available at: https://github.com/archiki/ReCEval
翻译:多步推理能力是许多自然语言任务的基础,但尚不清楚什么构成一个好的推理链以及如何评估它们。现有方法大多仅关注推理链是否导向正确结论,但这种以答案为导向的视角可能将推理质量与预测答案的其他虚假捷径混淆。为弥补这一差距,我们将推理链视为推导最终答案的非形式化证明进行评估。具体而言,我们提出ReCEval(推理链评估)框架,通过两个关键特性评估推理链:(1)正确性,即每一步骤基于该步骤、前序步骤及输入上下文中的信息做出有效推理;(2)信息性,即每一步骤提供有助于推导生成答案的新信息。我们利用自然语言推理模型和V-信息开发度量标准来评估这些特性。在多个数据集上,我们证明ReCEval能有效识别各类错误类型,并与先前方法相比取得显著改进。我们分析了步骤边界及前序步骤对评估正确性的影响,并展示我们的信息性度量能够捕捉高质量推理链中预期的信息流。最后,我们表明基于ReCEval对推理链进行评分可提升下游任务性能。我们的代码公开于:https://github.com/archiki/ReCEval