Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning steps to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce Reveal: Reasoning Verification Evaluation, a new dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question answering settings. Reveal includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a wide variety of datasets and state-of-the-art language models.
翻译:提示语言模型提供逐步解答(如“思维链”)是复杂推理任务中的主导方法,更准确的推理链通常会提升下游任务性能。近期文献讨论了自动验证推理步骤的方法,以评估和改进其正确性。然而,目前尚无细粒度的步骤级数据集可用于全面评估此类验证方法,从而阻碍了这一方向的进展。我们提出了Reveal:推理验证评估,这是一个新的数据集,旨在对开放域问答场景中复杂思维链推理的自动验证器进行基准测试。Reveal包含了针对语言模型答案中每个推理步骤的相关性、证据段落归因以及逻辑正确性的全面标注,涵盖了多种数据集和当前最先进的语言模型。