Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations. However, these methods rely on advantage gaps induced by high-quality samples within the same batch, which makes the training process fragile and inefficient when intra-group advantages collapse under challenging tasks. To address these problems, we propose a reinforcement learning mechanism named \emph{\textbf{R^3}} that along three directions: (1) a \emph{cross-context \underline{\textbf{R}}eplay} strategy that maintains the intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) an \emph{in-context self-\underline{\textbf{R}}eflection} mechanism enabling models to refine outputs by leveraging past failures, and (3) a \emph{structural entropy \underline{\textbf{R}}anking reward}, which assigns relative rewards to truncated or failed samples by ranking responses based on token-level entropy patterns, capturing both local exploration and global stability. We implement our method on Deepseek-R1-Distill-Qwen-1.5B and train it on the DeepscaleR-40k in the math domain. Experiments demonstrate our method achieves SoTA performance on several math benchmarks, representing significant improvements and fewer reasoning tokens over the base models. Code and model will be released.
翻译:大型推理模型旨在通过结构化推理解决多样化复杂问题。基于分组的策略优化方法近期取得进展,展现出在不依赖过程级标注的情况下实现稳定优势估计的潜力。然而,这些方法依赖于同批次内高质量样本诱导的优势差距,当组内优势在挑战性任务下坍缩时,会导致训练过程脆弱且低效。为解决这些问题,我们提出名为 \emph{\textbf{R^3}} 的强化学习机制,其沿三个方向展开:(1) \emph{跨上下文\textbf{回放}}策略,通过从同一查询的历史轨迹中召回高价值样本来维持组内优势;(2) \emph{上下文内自\textbf{反思}}机制,使模型能够利用过往失败经验优化输出;(3) \emph{结构熵\textbf{排序奖励}},通过基于词元级熵模式对响应进行排序,为截断或失败的样本分配相对奖励,同时捕捉局部探索性与全局稳定性。我们在 Deepseek-R1-Distill-Qwen-1.5B 上实现该方法,并在数学领域的 DeepscaleR-40k 数据集上进行训练。实验表明,该方法在多个数学基准测试中达到最先进性能,相较于基线模型实现了显著性能提升且消耗更少推理词元。代码与模型将公开发布。