SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine its accuracy with explicit reasoning in single generation. We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP), showing consistent improvements in two applications: (1) training Process Reward Models (PRMs) for ranking and aggregating multiple generations, and (2) fine-tuning models via offline reinforcement learning for greedy decoding. On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $\sim$16% of training samples compared to human-labeled and other synthetically trained baselines. Additionally, it achieves competitive performance with MCTS-based methods while offering 2.3$\times$ speedup in terms of total token count. Manual analysis reveals complementary precision-recall characteristics with MCTS approaches, suggesting potential for ensemble methods. These results establish SPARE as a practical and scalable solution for automatic process supervision in LLM reasoning.

翻译：过程或逐步监督在提升大型语言模型（LLM）的复杂多步推理能力方面发挥了关键作用。然而，高效、高质量的自动化过程标注仍面临重大挑战。为此，我们提出基于参考引导评估的单次标注框架（SPARE），这是一种新颖的结构化框架，能够通过单次生成实现高效的单步标注，其核心机制包括将解答步骤与参考解对齐，并通过显式推理判定其准确性。我们在涵盖数学推理（GSM8K、MATH）、多跳问答（MuSiQue-Ans）和空间推理（SpaRP）的四个多样化数据集上验证了SPARE的有效性，并在两个应用场景中展现出持续改进：（1）训练用于排序与聚合多轮生成结果的过程奖励模型（PRM）；（2）通过离线强化学习对模型进行微调以支持贪婪解码。在ProcessBench测试中，SPARE仅需约16%的训练样本（相较于人工标注及其他基于合成数据的基线方法），即展现出高效的数据利用能力和分布外泛化性能。同时，该方法在与基于蒙特卡洛树搜索（MCTS）的方法保持相当性能的前提下，实现了总标记数量2.3倍的加速。人工分析表明，SPARE与MCTS方法在精确率-召回率特性上具有互补性，这为构建集成方法提供了潜在可能。上述结果确立了SPARE作为LLM推理中自动过程监督的实用且可扩展解决方案的地位。