Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this paradigm, Group Relative Policy Optimization (GRPO) eliminates the need for critic models but suffers from indiscriminate credit assignment for intermediate steps, which limits its ability to identify effective reasoning strategies and incurs overthinking. In this work, we introduce a model-free and verifiable process supervision via probing the model's belief in the correct answer throughout its reasoning trajectory. By segmenting the generation into discrete steps and tracking the conditional probability of the correct answer appended at each segment boundary, we efficiently compute interpretable segment-wise progress measurements to refine GRPO's trajectory-level feedback. This approach enables more targeted and sample-efficient policy updates, while avoiding the need for intermediate supervision derived from costly Monte Carlo rollouts or auxiliary models. Experiments on mathematical and general-domain benchmarks show consistent gains over GRPO across diverse models: up to 2.6-point accuracy improvements and 13.7% reasoning-length reductions on math tasks, and up to 2.4 points and 4% on general-domain tasks, demonstrating strong generalization.
翻译:基于可验证奖励的强化学习(RLVR)通过利用直接结果验证取代学习型奖励模型,提升了大型语言模型(LLMs)的推理能力。在此范式基础上,群体相对策略优化(GRPO)消除了对评论家模型的需求,但存在对中间步骤的信用分配不明确问题,这限制了其识别有效推理策略的能力并引发过度思考。本文通过探究模型在推理轨迹中对正确答案的信念,引入了一种无模型且可验证的过程监督方法。通过将生成过程分割为离散步骤,并追踪每个步骤边界处附加正确答案的条件概率,我们高效计算出可解释的分步进展度量,以优化GRPO基于轨迹的反馈机制。该方法实现了更具针对性和样本高效性的策略更新,同时避免了依赖代价高昂的蒙特卡洛展开或辅助模型生成的中间监督。在数学与通用领域基准测试上的实验表明,该方法在不同模型上均持续优于GRPO:数学任务上准确率提升高达2.6个百分点,推理长度减少13.7%;通用领域任务上准确率提升达2.4个百分点,推理长度减少4%,展现了强大的泛化能力。