Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization, considering the amount of computation required. This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR. We propose PREPO with two complementary components. First, we adopt prompt perplexity as an indicator of model adaptability in learning, enabling the model to progress from well-understood contexts to more challenging ones. Second, we amplify the discrepancy among the rollouts by differentiating their relative entropy, and prioritize sequences that exhibit a higher degree of exploration. Together, these mechanisms reduce rollout demand while preserving competitive performance. On the Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than the baselines. Beyond empirical gains, we provide theoretical and in-depth analyses explaining the underlying rationale of our method to improve the data efficiency of RLVR.
翻译:带有可验证奖励的强化学习(RLVR)提升了大型语言模型的推理能力,但由于许多推理轨迹对优化的贡献有限,考虑到所需的计算量,训练成本仍然高昂。本研究探讨了如何简单地利用训练过程中几乎免费的内在数据特性来提高RLVR的数据效率。我们提出了包含两个互补组件的PREPO方法。首先,我们采用提示困惑度作为模型学习适应性的指标,使模型能够从易于理解的上下文逐步过渡到更具挑战性的上下文。其次,我们通过区分推理轨迹的相对熵来放大它们之间的差异,并优先选择表现出更高探索程度的序列。这些机制共同作用,在保持竞争性能的同时减少了对推理轨迹的需求。在Qwen和Llama模型上,PREPO在数学推理基准测试中取得了显著效果,其所需的推理轨迹数量比基线方法减少了多达3倍。除了实证收益外,我们还提供了理论和深入分析,以解释我们方法提高RLVR数据效率的基本原理。