In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.
翻译:本文提出一种创新的面向数学解题过程的奖励模型——**Math-Shepherd**,该模型为数学问题求解的每个步骤分配奖励分数。Math-Shepherd的训练通过自动构建的过程级监督数据实现,打破了现有工作中对人工标注严重依赖的瓶颈。我们在两个场景下探索Math-Shepherd的有效性:1)**验证**:将Math-Shepherd用于对大型语言模型(LLMs)生成的多个输出进行重排序;2)**强化学习**:利用Math-Shepherd结合逐步近端策略优化(PPO)方法强化LLMs。借助Math-Shepherd,一系列开源LLMs展现出卓越性能。例如,基于Math-Shepherd的逐步PPO将Mistral-7B的准确率显著提升(GSM8K:77.9%→84.1%;MATH:28.6%→33.0%)。通过Math-Shepherd的验证,GSM8K和MATH上的准确率可进一步提高至89.1%和43.5%。我们相信自动过程监督对LLMs的未来发展具有重要潜力。