基于结果的优势重塑：数学推理中的细粒度信用分配 (Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning)

Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.

翻译：群体相对策略优化（GRPO）已成为推理任务中一种有前景的无评论者强化学习范式。然而，标准GRPO采用粗粒度的信用分配机制，将群体层面的奖励均匀传播给序列中的每个标记，忽略了各个推理步骤的不同贡献。我们通过引入基于结果的优势重塑（OAR）来解决这一局限性，这是一种细粒度的信用分配机制，根据每个标记对模型最终答案的影响程度重新分配优势。我们通过两种互补策略实现OAR：（1）OAR-P，通过反事实标记扰动估计结果敏感性，作为高保真归因信号；（2）OAR-G，使用输入梯度敏感性代理，通过单次反向传播近似影响信号。这些重要性信号与保守的双层优势重塑方案相结合，该方案抑制低影响标记并增强关键标记，同时保持整体优势质量。在广泛的数学推理基准测试上的实证结果表明，虽然OAR-P设定了性能上限，但OAR-G以可忽略的计算开销实现了可比的增益，两者均显著优于强大的GRPO基线，推动了无评论者大语言模型推理的边界。