Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning steps to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce Reveal: Reasoning Verification Evaluation, a new dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question answering settings. Reveal includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a wide variety of datasets and state-of-the-art language models.
翻译:在复杂推理任务中,提示语言模型提供逐步答案(例如“思维链”)是主流方法,其中更准确的推理链通常能提升下游任务性能。近期文献探讨了自动验证推理步骤的方法,以评估并改进其正确性。然而,目前缺乏细粒度的步骤级数据集来全面评估此类验证方法,阻碍了这一方向的研究进展。我们推出Reveal:推理验证评估——一个用于在开放域问答场景中基准测试复杂思维链推理自动验证器的新数据集。Reveal涵盖对语言模型答案中每个推理步骤的相关性、证据段落归因及逻辑正确性的全面标注,数据来源涵盖多种数据集和先进语言模型。