The recent progress in large language models (LLMs), especially the invention of chain-of-thoughts (CoT) prompting, makes it possible to solve reasoning problems. However, even the strongest LLMs are still struggling with more complicated problems that require non-linear thinking and multi-step reasoning. In this work, we explore whether LLMs have the ability to recognize their own errors, without resorting to external resources. In particular, we investigate whether they can be used to identify individual errors within a step-by-step reasoning. To this end, we propose a zero-shot verification scheme to recognize such errors. We then use this verification scheme to improve question-answering performance, by using it to perform weighted voting on different generated answers. We test the method on three math datasets-GSM8K, MathQA, and MATH-and find that it successfully recognizes errors and, in turn, increases final predictive performance.
翻译:近年来,大语言模型(LLMs)的进展,尤其是思维链(CoT)提示的提出,使得解决推理问题成为可能。然而,即使是最强大的大语言模型,在处理需要非线性思维与多步推理的复杂问题时仍面临挑战。本研究探讨大语言模型是否具备在不依赖外部资源的情况下识别自身错误的能力,特别关注其能否在逐步推理过程中定位单个错误。为此,我们提出一种零样本验证方案来识别此类错误,并通过该方案对不同生成答案进行加权投票,从而提升问答性能。我们在三个数学数据集(GSM8K、MathQA 和 MATH)上测试该方法,结果表明它能够有效识别错误,并显著提高最终预测性能。