Process Reward Models (PRMs), which assign fine-grained scores to intermediate reasoning steps within a solution trajectory, have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs). However, most existing PRMs rely on a unidirectional left-to-right (L2R) evaluation scheme, which restricts their utilization of global context. In light of this challenge, we propose a novel bidirectional evaluation paradigm, named Bidirectional Process Reward Model (BiPRM). BiPRM incorporates a parallel right-to-left (R2L) evaluation stream, implemented via prompt reversal, alongside the conventional L2R flow. Then a gating mechanism is introduced to adaptively fuse the reward scores from both streams to yield a holistic quality assessment. Remarkably, compared to the original PRM, BiPRM introduces only a 0.3% parameter increase for the gating module, and the parallel execution of two streams incurs merely 5% inference time latency. Our extensive empirical evaluations spanning diverse benchmarks, LLM backbones, PRM objectives and sampling policies demonstrate that BiPRM consistently surpasses unidirectional baselines, achieving an average relative gain of 10.6% over 54 solution-level configurations and 37.7% in 12 step-level error detection scenarios. Generally, our results highlight the effectiveness, robustness and general applicability of BiPRM, offering a promising new direction for process-based reward modeling.
翻译:过程奖励模型(PRMs)通过为解决方案轨迹中的中间推理步骤分配细粒度分数,已成为提升大语言模型(LLMs)推理质量的一种有前景的方法。然而,现有的大多数PRMs依赖于单向从左到右(L2R)的评估方案,这限制了其对全局上下文的利用。针对这一挑战,我们提出了一种新颖的双向评估范式,称为双向过程奖励模型(BiPRM)。BiPRM在传统的L2R流程基础上,通过提示反转实现了一个并行的从右到左(R2L)评估流。随后,引入一个门控机制来自适应地融合两个流的奖励分数,以产生一个整体的质量评估。值得注意的是,与原始PRM相比,BiPRM仅为门控模块引入了0.3%的参数增加,并且两个流的并行执行仅带来5%的推理时间延迟。我们在多样化基准测试、LLM主干网络、PRM目标和采样策略上进行的大量实证评估表明,BiPRM始终优于单向基线模型,在54种解决方案级别配置中平均相对增益达到10.6%,在12种步骤级别错误检测场景中达到37.7%。总体而言,我们的结果凸显了BiPRM的有效性、鲁棒性和普适性,为基于过程的奖励建模提供了一个有前景的新方向。