The recent progress in large language models (LLMs), especially the invention of chain-of-thought prompting, has made it possible to automatically answer questions by stepwise reasoning. However, when faced with more complicated problems that require non-linear thinking, even the strongest LLMs make mistakes. To address this, we explore whether LLMs are able to recognize errors in their own step-by-step reasoning, without resorting to external resources. To this end, we propose SelfCheck, a general-purpose zero-shot verification schema for recognizing such errors. We then use the results of these checks to improve question-answering performance by conducting weighted voting on multiple solutions to the question. We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
翻译:近年来大语言模型(LLMs)的进展,尤其是思维链提示的发明,使得通过逐步推理自动回答问题成为可能。然而,面对需要非线性思维的更复杂问题时,即使最强的大语言模型也会出错。为解决这一问题,我们探索了大语言模型能否在无需借助外部资源的情况下识别自身逐步推理中的错误。为此,我们提出SelfCheck——一种通用的零样本验证方案,用于识别此类错误。随后,我们利用这些检查结果,通过对问题的多个解答进行加权投票来提升问答性能。我们在三个数据集(GSM8K、MathQA和MATH)上测试了SelfCheck,发现该方法能成功识别错误,并进而提高最终答案的准确率。