While reinforcement learning with verifiable rewards (RLVR) has advanced LLM reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain (gold-CoT). The core of our method is to calculate a Path Faithfulness Reward (PFR) for each reasoning step, which is derived from the conditional probability of generating the gold-CoT's suffix, given the model's current reasoning prefix. Crucially, this PFR can be flexibly integrated with any outcome-based reward, directly tackling the reward sparsity problem by providing dense guidance. Extensive experiments on reading comprehension and medical Question Answering benchmarks show that P2S significantly outperforms strong baselines.
翻译:尽管基于可验证奖励的强化学习(RLVR)在数学和编程等结构化领域推动了大型语言模型的推理能力,但由于缺乏可验证的奖励信号,其在通用领域推理任务中的应用仍面临挑战。为此,出现了如基于参考概率奖励的强化学习(RLPR)等方法,利用生成最终答案的概率作为奖励信号。然而,这些以结果为导向的方法忽视了对推理过程本身的关键性逐步监督。为弥补这一不足,我们提出了概率过程监督(P2S),一种新颖的自监督框架,无需独立的奖励模型或人工标注的推理步骤即可提供细粒度的过程奖励。在强化学习过程中,P2S会合成并筛选高质量参考推理链(gold-CoT)。我们方法的核心是为每个推理步骤计算路径忠实度奖励(PFR),该奖励源自给定模型当前推理前缀时生成gold-CoT后缀的条件概率。关键的是,此PFR可与任何基于结果的奖励灵活结合,通过提供密集指导直接解决奖励稀疏性问题。在阅读理解和医学问答基准上的大量实验表明,P2S显著优于现有强基线方法。