Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
翻译:大语言模型在提示生成逐步推理以证明其最终答案时,展现出更优的下游任务性能。这些推理步骤极大提升了模型的可解释性与可验证性,但缺乏可靠的自动评估方法时,客观研究其正确性(独立于最终答案)十分困难。我们尚不清楚所陈述的推理步骤在多大程度上真正支持最终任务预测。本研究提出ROSCOE,一套可解释、无监督的自动评分指标,改进并扩展了先前的文本生成评估指标。为评估ROSCOE与基线指标的性能,我们设计了推理错误类型学分类,并在常用推理数据集上收集了合成评估分数与人工评分。与现有指标相比,ROSCOE通过利用逐步推理依据的属性,可衡量语义一致性、逻辑性、信息量、流畅性与事实性等特征。我们在五个人工标注和六个程序化扰动的诊断数据集上实证验证了指标的有效性——这些数据集涵盖需要推理能力的多样化任务,结果表明ROSCOE始终优于基线指标。