Large language models (LLMs) often struggle with maintaining accuracy across a sequence of intermediate reasoning steps in mathematical reasoning, leading to error propagation that undermines the final result. The current methodology to mitigate this issue primarily involves using a verifier model to assess the correctness of generated solution candidates, focusing either on the overall reasoning path or on an incomplete reasoning path. By rethinking this approach, we argue that assessing potentials of incomplete reasoning paths could be more advantageous as it guides towards correct final answers, transforming the task into a \textit{planning} problem. Our proposed verifier, the Outcome-supervision Value Model (OVM), employs outcome supervision for training, offering an efficient and intuitive method for \textit{planning} by prioritizing steps that lead to accurate conclusions over mere per-step correctness. Furthermore, the OVM eschews the need for labor-intensive annotations on step-level correctness, enhancing its scalability. Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model. Notably, in GSM8K, our \textbf{OVM-7B model achieves state-of-the-art results among LLMs up to 13B parameters}; especially it does not utilize GPT-4 or code execution. These findings offer a novel perspective on the role of outcome supervision in training verifiers for multi-step reasoning tasks and provide theoretical justification for its advantage in value estimation for planning.
翻译:大语言模型(LLMs)在数学推理过程中,常难以维持中间推理步骤的序列准确性,导致错误传播并影响最终结果。当前缓解该问题的主流方法主要采用验证器模型来评估生成候选解的正确性,其评估对象包括完整推理路径或不完整推理路径。通过重新审视这一思路,我们认为对不完整推理路径的潜力进行评估更具优势——它能够引导模型趋近正确最终答案,从而将任务转化为**规划**问题。我们提出的验证器——结果监督价值模型(Outcome-supervision Value Model, OVM),通过结果监督进行训练,为**规划**提供了高效且直观的方法:优先选择指向准确结论的步骤,而非单纯关注单步正确性。此外,OVM无需对步骤级正确性进行繁琐的人工标注,显著提升了可扩展性。在GSM8K和Game of 24这两个多步数学推理数据集上的实验表明,OVM模型性能卓越。值得注意的是,在GSM8K中,我们的**OVM-7B模型在参数量不超过13B的大语言模型中达到了最先进水平**,且其实现未使用GPT-4或代码执行。这些发现为结果监督在多步推理任务验证器训练中的作用提供了全新视角,并从理论上论证了其在规划价值估计中的优势。