Outcome-supervised Verifiers for Planning in Mathematical Reasoning

Large language models (LLMs) often struggle with maintaining accuracy across a sequence of intermediate reasoning steps in mathematical reasoning, leading to error propagation that undermines the final result. The current methodology to mitigate this issue primarily involves using a verifier model to assess the correctness of generated solution candidates, focusing either on the overall reasoning path or on an incomplete reasoning path. By rethinking this approach, we argue that assessing potentials of incomplete reasoning paths could be more advantageous as it guides towards correct final answers, transforming the task into a \textit{planning} problem. Our proposed verifier, the Outcome-supervision Value Model (OVM), employs outcome supervision for training, offering an efficient and intuitive method for \textit{planning} by prioritizing steps that lead to accurate conclusions over mere per-step correctness. Furthermore, the OVM eschews the need for labor-intensive annotations on step-level correctness, enhancing its scalability. Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model. Notably, in GSM8K, our \textbf{OVM-7B model achieves state-of-the-art results among LLMs up to 13B parameters}; especially it does not utilize GPT-4 or code execution. These findings offer a novel perspective on the role of outcome supervision in training verifiers for multi-step reasoning tasks and provide theoretical justification for its advantage in value estimation for planning.

翻译：大语言模型（LLMs）在数学推理过程中，常难以维持中间推理步骤的序列准确性，导致错误传播并影响最终结果。当前缓解该问题的主流方法主要采用验证器模型来评估生成候选解的正确性，其评估对象包括完整推理路径或不完整推理路径。通过重新审视这一思路，我们认为对不完整推理路径的潜力进行评估更具优势——它能够引导模型趋近正确最终答案，从而将任务转化为**规划**问题。我们提出的验证器——结果监督价值模型（Outcome-supervision Value Model, OVM），通过结果监督进行训练，为**规划**提供了高效且直观的方法：优先选择指向准确结论的步骤，而非单纯关注单步正确性。此外，OVM无需对步骤级正确性进行繁琐的人工标注，显著提升了可扩展性。在GSM8K和Game of 24这两个多步数学推理数据集上的实验表明，OVM模型性能卓越。值得注意的是，在GSM8K中，我们的**OVM-7B模型在参数量不超过13B的大语言模型中达到了最先进水平**，且其实现未使用GPT-4或代码执行。这些发现为结果监督在多步推理任务验证器训练中的作用提供了全新视角，并从理论上论证了其在规划价值估计中的优势。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日