Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.
翻译:基于可验证奖励的强化学习(RLVR)通过使用基于规则的二元反馈,提升了大语言模型(LLM)的推理能力。然而,当前的RLVR方法通常为每个标记分配相同的奖励。这种粗粒度的反馈阻碍了精确的信用分配,使得模型难以识别哪些推理步骤导致了成功或失败,并常常产生次优策略。像PPO这样的方法通过价值估计提供信用分配,但由于采样有限,会产生不准确且不可验证的信号。另一方面,使用过程奖励模型的方法可以提供逐步奖励,但存在几个关键局限:它们需要高质量的过程监督标注,由于概率性奖励建模导致反馈不可靠,并且其在在线强化学习(RL)中的应用耗时较长。为克服这些局限,我们提出了一种简单而高效的方法——信用分配策略优化(CAPO)。CAPO不训练辅助模型,而是直接利用一个现成的通用LLM作为生成式过程奖励模型(LLM-as-GenPRM),仅基于步骤本身的正确性,通过单次生成所有逐步评判,提供确定性的标记级信用,以优化那些原本被分配了相同基于规则奖励的标记。为进一步提升准确性和鲁棒性,我们采用了随生成评判数量扩展的投票机制。在Llama和Qwen等多种骨干模型上进行的大量实验表明,在四个具有挑战性的数学基准和三个领域外基准上,CAPO始终优于基于监督学习和基于RL的微调方法。进一步分析表明,CAPO能够帮助模型学习导致正确答案的正确推理路径。