Recently, with the chain of thought (CoT) prompting, large language models (LLMs), e.g., GPT-3, have shown strong reasoning ability in several natural language processing tasks such as arithmetic, commonsense, and logical reasoning. However, LLMs with CoT require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes and vulnerable to error accumulation. The above issues make the LLMs need the ability to verify the answers. In fact, after inferring conclusions in some thinking decision tasks, people often check them by re-verifying steps to avoid some mistakes. In this paper, we propose and prove that LLMs also have similar self-verification abilities. We take the conclusion obtained by CoT as one of the conditions for solving the original problem. By taking turns masking the original conditions and predicting their results, we calculate an explainable answer verification score based on whether the re-predicted conditions are correct. Experimental results demonstrate that the proposed method can improve the reasoning performance on various arithmetic, commonsense, and logical reasoning datasets. Our code is publicly available at: https://github.com/WENGSYX/Self-Verification.
翻译:最近,随着思维链提示的出现,大型语言模型(如GPT-3)在算术、常识和逻辑推理等多个自然语言处理任务中展现出强大的推理能力。然而,采用思维链的大型语言模型需要多步骤提示和多token预测,这使得其极易受个别错误影响且易出现错误累积。上述问题要求大型语言模型具备答案验证能力。事实上,人类在完成某些思维决策任务的结论推理后,通常会通过重新验证步骤来核查结论以避免错误。本文提出并证明大型语言模型同样具备类似的自我验证能力。我们将思维链得出的结论作为解决原始问题的条件之一,通过交替掩蔽原始条件并预测其重建结果,基于重新预测条件是否正确来计算可解释的答案验证分数。实验结果表明,所提方法能够在多个算术、常识和逻辑推理数据集上提升推理性能。我们的代码已公开于:https://github.com/WENGSYX/Self-Verification。