Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.

翻译：近年来，大型语言模型（如GPT-4和PaLM-2）的进展显著提升了数学推理问题的处理能力。特别是OpenAI最新版本的GPT-4（即GPT-4代码解释器）在具有挑战性的数学数据集上表现出色。本文通过引入对GPT-4代码解释器《代码使用频率》的不同约束，探究代码对增强语言模型推理能力的影响。我们发现，其成功主要归功于生成和执行代码、评估代码执行输出，以及在收到不合理输出时修正解决方案的强大能力。基于这一发现，我们提出了一种新颖且有效的提示方法——显式基于代码的自我验证（CSV），以进一步提升GPT-4代码解释器的数学推理潜力。该方法在GPT-4代码解释器上使用零样本提示，鼓励其通过代码自我验证答案。当验证状态标记为“假”时，模型会自动修正其解决方案，类似于我们在数学考试中纠正错误的方式。此外，我们认识到验证结果的状态反映了解决方案的置信度，这可以改进多数投票的有效性。借助GPT-4代码解释器和CSV，我们在MATH数据集上实现了令人瞩目的零样本准确率提升（**53.9% → 84.3%**）。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日