Recent advances in reinforcement learning with verifiable rewards (RLVR) show that large language models enhance their reasoning abilities when trained with verifiable signals. However, due to reward sparsity, effectiveness depends heavily on selecting samples of appropriate difficulty. In this work, we present a formal analysis of online difficulty-aware filtering and establish its theoretical foundations. We show that expected policy improvement is lower-bounded by the variance of task-level success probabilities, implying that selecting tasks of intermediate difficulty maximizes learning efficiency. Building on this, we demonstrate that balanced filtering maximizes this lower bound, leading to superior performance and sample efficiency. Evaluations across multiple math reasoning benchmarks validate that balanced filtering consistently enhances convergence speed and final performance, achieving up to +12% gains in less than half the training steps of standard GRPO. By extending our analysis to various reward distributions, we provide a principled foundation for future RLVR curriculum strategies, confirmed through both theoretical analysis and extensive empirical results.
翻译:近期在可验证奖励强化学习(RLVR)方面的进展表明,大型语言模型在通过可验证信号训练时能提升其推理能力。然而,由于奖励稀疏性,其有效性在很大程度上取决于选择适当难度的样本。在本工作中,我们对在线难度感知过滤进行了形式化分析并建立了其理论基础。我们证明,策略的期望改进受限于任务级成功概率的方差,这意味着选择中等难度的任务能最大化学习效率。基于此,我们论证了平衡过滤能使该下界最大化,从而带来更优的性能和样本效率。在多个数学推理基准上的评估验证了平衡过滤能持续提升收敛速度和最终性能,在少于标准GRPO一半的训练步数内实现了高达+12%的性能增益。通过将我们的分析扩展到多种奖励分布,我们为未来的RLVR课程策略提供了原则性基础,这一点已通过理论分析和广泛的实证结果得到证实。