Chain-of-Thought (CoT) prompting has marked a significant advancement in enhancing the reasoning capabilities of large language models (LLMs). Previous studies have developed various extensions of CoT, which focus primarily on enhancing end-task performance. In addition, there has been research on assessing the quality of reasoning chains in CoT. This raises an intriguing question: Is it possible to predict the accuracy of LLM outputs by scrutinizing the reasoning chains they generate? To answer this research question, we introduce a benchmark, R2PE, designed specifically to explore the relationship between reasoning chains and performance in various reasoning tasks spanning five different domains. This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps. To make full use of information in multiple reasoning chains, we propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin. Concretely, this resulted in an average of 5.1% increase in the F1 score across all 45 subsets within R2PE. We further demonstrate our PDS's efficacy in advancing open-domain QA accuracy. Data and code are available at https://github.com/XinXU-USTC/R2PE.
翻译:思维链(Chain-of-Thought,CoT)提示方法在增强大型语言模型(LLMs)推理能力方面取得了显著进展。先前研究开发了多种CoT的扩展形式,主要聚焦于提升终端任务性能。此外,已有研究开始评估CoT中推理链的质量。这引发了一个有趣的问题:是否可以通过审视LLM生成的推理链来预测其输出的准确性?为解答此研究问题,我们引入了一个专门设计的基准测试R2PE,旨在探索跨五个不同领域的多种推理任务中推理链与性能之间的关系。该基准通过推理步骤衡量LLM最终输出的错误程度。为充分利用多条推理链中的信息,我们提出了过程可辨别性评分(Process Discernibility Score,PDS)框架,该框架以较大优势超越了答案检查基线方法。具体而言,在R2PE的45个子集中,该方法使F1分数平均提升了5.1%。我们还进一步展示了PDS在提升开放域问答准确性方面的有效性。数据和代码可在https://github.com/XinXU-USTC/R2PE获取。