Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the models to produce Chain-of-Thought (CoT) traces. The final answer can only be verified, and the reward assigned, after the CoT trace is complete, making it a delayed reward problem. GRPO and its modifications correspond to Monte Carlo methods in standard RL, which are known to suffer from high variance. A possible solution to this problem is the redistribution of rewards through credit assignment, where segments of the CoT trace that are important for arriving at the desirable solution are emphasized by assigning a higher reward. While Monte Carlo sampling can be used to provide an unbiased estimate of intermediate state values, its computational overhead makes it unsuitable for train-time credit assignment in long contexts at high granularity. We introduce RREDCoT (Reward REDistribution for Chain of Thoughts), which utilizes the model itself to approximate the optimal reward redistribution without additional generation. We investigate the advantages of our method compared to MC sampling and several attribution methods. We further analyze several aspects relevant to the construction of the redistribution such as segmentation of CoT traces and state value estimation.
翻译:近期推理语言模型的进展得益于强化学习微调。这类方法通常依赖群组相对策略优化(GRPO)算法或其变体,引导模型生成思维链(CoT)轨迹。由于最终答案只能在CoT轨迹完成后进行验证并分配奖励,这构成了延迟奖励问题。GRPO及其变体对应标准强化学习中的蒙特卡洛方法,其已知缺陷是方差较高。该问题的一个潜在解决方案是通过信用分配实现奖励再分配,即通过为CoT轨迹中达成理想解的关键片段赋予更高奖励来突出其重要性。尽管蒙特卡洛采样可提供中间状态值的无偏估计,但其计算开销使其不适用于长上下文、高粒度场景下的训练时信用分配。我们提出RREDCoT(思维链奖励再分配),该方法利用模型自身近似最优奖励再分配,无需额外生成。我们研究了该方法相较于MC采样及多种归因方法的优势,并进一步分析了与再分配构建相关的若干因素,如CoT轨迹分割与状态值估计。