CAPO：通过生成式信用分配增强大语言模型推理能力 (CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment)

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.

翻译：基于可验证奖励的强化学习（RLVR）通过使用基于规则的二元反馈，提升了大语言模型（LLM）的推理能力。然而，当前的RLVR方法通常为每个标记分配相同的奖励。这种粗粒度的反馈阻碍了精确的信用分配，使得模型难以识别哪些推理步骤导致了成功或失败，并常常产生次优策略。像PPO这样的方法通过价值估计提供信用分配，但由于采样有限，会产生不准确且不可验证的信号。另一方面，使用过程奖励模型的方法可以提供逐步奖励，但存在几个关键局限：它们需要高质量的过程监督标注，由于概率性奖励建模导致反馈不可靠，并且其在在线强化学习（RL）中的应用耗时较长。为克服这些局限，我们提出了一种简单而高效的方法——信用分配策略优化（CAPO）。CAPO不训练辅助模型，而是直接利用一个现成的通用LLM作为生成式过程奖励模型（LLM-as-GenPRM），仅基于步骤本身的正确性，通过单次生成所有逐步评判，提供确定性的标记级信用，以优化那些原本被分配了相同基于规则奖励的标记。为进一步提升准确性和鲁棒性，我们采用了随生成评判数量扩展的投票机制。在Llama和Qwen等多种骨干模型上进行的大量实验表明，在四个具有挑战性的数学基准和三个领域外基准上，CAPO始终优于基于监督学习和基于RL的微调方法。进一步分析表明，CAPO能够帮助模型学习导致正确答案的正确推理路径。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日