Math-Shepherd: A Label-Free Step-by-Step Verifier for LLMs in Mathematical Reasoning

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, even the most advanced open-source LLMs, such as the LLaMA family models, still face challenges when it comes to accurately solving complex multi-step mathematical problems. In this paper, we present an innovative process-oriented math verifier called \textbf{Math-Shepherd}, which assigns a reward score to each step of the LLM's outputs on math problems. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. With the guidance of Math-Shepherd, a series of open-source LLMs demonstrate exceptional performance. Among them, DeepSeek 67B \citep{DeepSeek-llm} stands out by achieving accuracy rates of 93.3\% on the GSM8K dataset and 48.1\% on the MATH dataset, without external enhancement such as tool usage. Our Math-Shepherd also outperforms the self-consistency method and other existing verification models. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.

翻译：大型语言模型（LLMs）在多种任务中展现出卓越能力，然而即便是最先进的开源LLMs（如LLaMA系列模型），在精准求解复杂多步数学问题时仍面临挑战。本文提出一种创新的过程导向型数学验证器——**Math-Shepherd**，该验证器能为LLMs在数学问题中输出的每个步骤分配奖励分数。通过自动构建的过程级监督数据进行训练，Math-Shepherd突破了现有工作对人工标注的严重依赖瓶颈。在Math-Shepherd的引导下，一系列开源LLMs展现出卓越性能，其中DeepSeek 67B（引用）在无需工具增强等外部手段的情况下，于GSM8K数据集和MATH数据集上分别达到93.3%和48.1%的准确率。此外，Math-Shepherd的性能也优于自一致性方法及其他现有验证模型。我们相信，自动过程监督对LLMs的未来发展具有重要潜力。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日