Multi-step reasoning ability is fundamental to many natural language tasks, yet it is unclear what constitutes a good reasoning chain and how to evaluate them. Most existing methods focus solely on whether the reasoning chain leads to the correct conclusion, but this answer-oriented view may confound the quality of reasoning with other spurious shortcuts to predict the answer. To bridge this gap, we evaluate reasoning chains by viewing them as informal proofs that derive the final answer. Specifically, we propose ReCEval (Reasoning Chain Evaluation), a framework that evaluates reasoning chains through two key properties: (1) correctness, i.e., each step makes a valid inference based on the information contained within the step, preceding steps, and input context, and (2) informativeness, i.e., each step provides new information that is helpful towards deriving the generated answer. We implement ReCEval using natural language inference models and information-theoretic measures. On multiple datasets, ReCEval is highly effective in identifying different types of errors, resulting in notable improvements compared to prior methods. We demonstrate that our informativeness metric captures the expected flow of information in high-quality reasoning chains and we also analyze the impact of previous steps on evaluating correctness and informativeness. Finally, we show that scoring reasoning chains based on ReCEval can improve downstream performance of reasoning tasks. Our code is publicly available at: https://github.com/archiki/ReCEval
翻译:多步推理能力是许多自然语言任务的基础,然而,目前尚不清楚什么构成良好的推理链以及如何对其进行评估。现有方法大多仅关注推理链是否导向正确结论,但这种以答案为导向的观点可能将推理质量与预测答案的其他虚假捷径混淆。为填补这一空白,我们将推理链视为推导最终答案的非形式化证明来进行评估。具体而言,我们提出ReCEval(推理链评估)框架,通过两个关键属性评估推理链:(1)正确性,即每个步骤基于该步骤本身、先前步骤及输入上下文中的信息做出有效推断;(2)信息性,即每个步骤提供有助于推导生成答案的新信息。我们使用自然语言推理模型和信息论度量实现ReCEval。在多个数据集上,ReCEval能高效识别不同类型的错误,相比先前方法取得显著改进。我们证明信息性指标能捕捉高质量推理链中预期的信息流,并分析了先前步骤对评估正确性与信息性的影响。最后,我们展示基于ReCEval对推理链评分可提升下游推理任务性能。我们的代码已开源:https://github.com/archiki/ReCEval