Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.
翻译:近年来,大型语言模型(如GPT-4和PaLM-2)的进展显著提升了数学推理问题的处理能力。特别是OpenAI最新版本的GPT-4(即GPT-4代码解释器)在具有挑战性的数学数据集上表现出色。本文通过引入对GPT-4代码解释器《代码使用频率》的不同约束,探究代码对增强语言模型推理能力的影响。我们发现,其成功主要归功于生成和执行代码、评估代码执行输出,以及在收到不合理输出时修正解决方案的强大能力。基于这一发现,我们提出了一种新颖且有效的提示方法——显式基于代码的自我验证(CSV),以进一步提升GPT-4代码解释器的数学推理潜力。该方法在GPT-4代码解释器上使用零样本提示,鼓励其通过代码自我验证答案。当验证状态标记为“假”时,模型会自动修正其解决方案,类似于我们在数学考试中纠正错误的方式。此外,我们认识到验证结果的状态反映了解决方案的置信度,这可以改进多数投票的有效性。借助GPT-4代码解释器和CSV,我们在MATH数据集上实现了令人瞩目的零样本准确率提升(**53.9% → 84.3%**)。