Recently, with the chain of thought (CoT) prompting, large language models (LLMs), e.g., GPT-3, have shown strong reasoning ability in several natural language processing tasks such as arithmetic, commonsense, and logical reasoning. However, LLMs with CoT require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes and vulnerable to error accumulation. The above issues make the LLMs need the ability to verify the answers. In fact, after inferring conclusions in some thinking decision tasks, people often check them by re-verifying steps to avoid some mistakes. In this paper, we propose and prove that LLMs also have similar self-verification abilities. We take the conclusion obtained by CoT as one of the conditions for solving the original problem. By taking turns masking the original conditions and predicting their results, we calculate an explainable answer verification score based on whether the re-predicted conditions are correct. Experimental results demonstrate that the proposed method can improve the reasoning performance on various arithmetic, commonsense, and logical reasoning datasets.
翻译:最近,随着思维链(CoT)提示技术的发展,大型语言模型(LLMs,如GPT-3)在算术、常识和逻辑推理等多项自然语言处理任务中展现出强大的推理能力。然而,采用CoT的LLMs需要多步提示和多令牌预测,这使其对单个错误高度敏感,且容易产生误差累积。上述问题要求LLMs具备答案验证能力。事实上,在进行某些思维决策任务的结论推断后,人们通常会通过重新验证步骤来避免错误。本文提出并证明LLMs同样具有类似的自我验证能力。我们将CoT得出的结论作为解决原问题的条件之一,通过交替掩码原始条件并预测其结果,基于重新预测的条件是否正确来计算可解释的答案验证分数。实验结果表明,所提方法能在多个算术、常识和逻辑推理数据集上提升推理性能。