In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.
翻译:在科学推理任务中,推理过程的真实性与其最终结果同等重要。过程奖励模型为解决结果奖励模型固有的粗粒度监督问题提供了方案,但其部署却因获取专家验证的逐步骤标注成本过高而受阻。本文致力于利用丰富但含噪声的“弱”监督来训练可靠PRM的挑战。我们认为,现有的弱到强泛化理论缺乏从噪声数据中选择高质量训练信号的规范性指导。为弥补这一差距,我们提出了双共识弱到强框架。通过将弱监督者间的自共识度量与嵌入空间中的邻域共识度量相结合,我们将监督信号分层为不同的可靠性区间。随后,我们采用实例级平衡采样与标签级可靠性感知掩码的课程学习策略来指导训练过程。我们证明,DC-W2S能够在无需详尽专家标注的情况下,为复杂推理训练出稳健的PRM,这证实了策略性的数据筛选比在大规模噪声数据集上进行无差别训练更为有效。