Policy optimization for large language models often suffers from sparse reward signals in multi-step reasoning tasks. Critic-free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low-reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process-level guidance in a critic-free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token-level advantages, and aligns their distribution with outcome advantages through location-parameter shift. On MATH500, PRPO improves Qwen2.5-Math-1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine-grained credit assignment within critic-free optimization. Code is available at: https://github.com/SchumiDing/srpocode
翻译:大语言模型的策略优化在多步推理任务中常面临奖励信号稀疏的问题。如GRPO等无评论家方法对所有词元分配单一归一化的结果奖励,对中间推理过程的指导有限。虽然过程奖励模型能提供密集反馈,但单独使用时存在过早坍缩的风险,早期低奖励词元可能驱使策略生成截断输出。我们提出过程相对策略优化方法,在无评论器框架中将结果可靠性与过程级指导相结合。PRPO基于语义线索分割推理序列,将PRM分数归一化为词元级优势度,并通过位置参数平移使其分布与结果优势度对齐。在MATH500数据集上,PRPO仅使用八次推演且无需价值网络,就将Qwen2.5-Math-1.5B的准确率从GRPO的61.2%提升至64.4%,证明了无评论家优化框架内细粒度信用分配的有效性。代码发布于:https://github.com/SchumiDing/srpocode