R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement Learning

Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations. However, these methods rely on advantage gaps induced by high-quality samples within the same batch, which makes the training process fragile and inefficient when intra-group advantages collapse under challenging tasks. To address these problems, we propose a reinforcement learning mechanism named \emph{\textbf{R^3}} that along three directions: (1) a \emph{cross-context \underline{\textbf{R}}eplay} strategy that maintains the intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) an \emph{in-context self-\underline{\textbf{R}}eflection} mechanism enabling models to refine outputs by leveraging past failures, and (3) a \emph{structural entropy \underline{\textbf{R}}anking reward}, which assigns relative rewards to truncated or failed samples by ranking responses based on token-level entropy patterns, capturing both local exploration and global stability. We implement our method on Deepseek-R1-Distill-Qwen-1.5B and train it on the DeepscaleR-40k in the math domain. Experiments demonstrate our method achieves SoTA performance on several math benchmarks, representing significant improvements and fewer reasoning tokens over the base models. Code and model will be released.

翻译：大型推理模型旨在通过结构化推理解决多样化复杂问题。基于分组的策略优化方法近期取得进展，展现出在不依赖过程级标注的情况下实现稳定优势估计的潜力。然而，这些方法依赖于同批次内高质量样本诱导的优势差距，当组内优势在挑战性任务下坍缩时，会导致训练过程脆弱且低效。为解决这些问题，我们提出名为 \emph{\textbf{R^3}} 的强化学习机制，其沿三个方向展开：(1) \emph{跨上下文\textbf{回放}}策略，通过从同一查询的历史轨迹中召回高价值样本来维持组内优势；(2) \emph{上下文内自\textbf{反思}}机制，使模型能够利用过往失败经验优化输出；(3) \emph{结构熵\textbf{排序奖励}}，通过基于词元级熵模式对响应进行排序，为截断或失败的样本分配相对奖励，同时捕捉局部探索性与全局稳定性。我们在 Deepseek-R1-Distill-Qwen-1.5B 上实现该方法，并在数学领域的 DeepscaleR-40k 数据集上进行训练。实验表明，该方法在多个数学基准测试中达到最先进性能，相较于基线模型实现了显著性能提升且消耗更少推理词元。代码与模型将公开发布。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

强化学习遇见大语言模型：贯穿 LLM 生命周期的进展与应用综述

专知会员服务

38+阅读 · 2025年9月23日

强化多模态大语言模型：基于强化学习的推理综述

专知会员服务

37+阅读 · 2025年5月3日

142页DeepSeek-R1 思维链技术：让我们一起<思考>大语言模型（LLM）的推理能力

专知会员服务

48+阅读 · 2025年4月12日

Llama-3-SynE：实现有效且高效的大语言模型持续预训练

专知会员服务

36+阅读 · 2024年7月30日