Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning. However, developing PRMs is challenging due to the high cost and limited scalability of human-annotated data. Synthetic data from Monte Carlo (MC) estimation is a promising alternative but suffers from a high noise ratio, which can cause overfitting and hinder large-scale training. In this work, we conduct a preliminary study on the noise distribution in synthetic data from MC estimation, identifying that annotation models tend to both underestimate and overestimate step correctness due to limitations in their annotation capabilities. Building on these insights, we propose Self-Denoising Monte Carlo Annotation (SCAN), an efficient data synthesis and noise-tolerant learning framework. Our key findings indicate that: (1) Even lightweight models (e.g., 1.5B parameters) can produce high-quality annotations through a self-denoising strategy, enabling PRMs to achieve superior performance with only 6% the inference cost required by vanilla MC estimation. (2) With our robust learning strategy, PRMs can effectively learn from this weak supervision, achieving a 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench. Despite using only a compact synthetic dataset, our models surpass strong baselines, including those trained on large-scale human-annotated datasets such as PRM800K. Furthermore, performance continues to improve as we scale up the synthetic data, highlighting the potential of SCAN for scalable, cost-efficient, and robust PRM training.
翻译:过程奖励模型(PRMs)提供细粒度的步骤级评估,有助于大型语言模型(LLMs)进行更深层次的推理过程,在数学推理等复杂任务中被证明是有效的。然而,由于人工标注数据成本高昂且可扩展性有限,开发PRMs具有挑战性。来自蒙特卡洛(MC)估计的合成数据是一种有前景的替代方案,但其噪声比率较高,可能导致过拟合并阻碍大规模训练。在本工作中,我们对MC估计生成的合成数据中的噪声分布进行了初步研究,发现由于标注模型自身能力的限制,它们倾向于同时低估和高估步骤的正确性。基于这些发现,我们提出了自去噪蒙特卡洛标注(SCAN),一种高效的数据合成与噪声容忍学习框架。我们的关键发现表明:(1)即使是轻量级模型(例如,15亿参数)通过自去噪策略也能产生高质量的标注,使得PRMs仅需原始MC估计所需推理成本的6%即可实现卓越性能。(2)借助我们鲁棒的学习策略,PRMs能够有效地从这种弱监督中学习,在ProcessBench上实现了39.2的F1分数提升(从19.9提升至59.1)。尽管仅使用紧凑的合成数据集,我们的模型仍超越了包括在PRM800K等大规模人工标注数据集上训练的模型在内的强基线。此外,随着合成数据规模的扩大,性能持续提升,这凸显了SCAN在可扩展、成本效益高且鲁棒的PRM训练方面的潜力。