Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.
翻译:摘要:标准可验证奖励强化学习(RLVR)训练为每个查询分配固定的生成预算,而未考虑每个查询的难度对当前策略的影响。这导致两种对称的失败模式:简单查询因策略已能解决而产生近乎为零的优势,而不可解查询因策略始终无法解决而不产生任何信号。两种模式均浪费训练FLOPs且无法贡献学习梯度。我们提出排序组策略优化(sGPO),这是一种计算高效策略,通过少量推理FLOPs的预算换取训练FLOPs的大幅减少。其核心洞察在于:低成本的推理计算可作为查询难度的一个离线代理指标。通过初始策略对每个查询生成一小批并行样本,我们获得模型感知的经验成功率。基于此,我们提出将训练生成组大小设置为该成功率的倒数——这一实用规则通过从每次生成中提取最大优势来最大化样本效率。这种单次性能分析过程同时驱动数据过滤(移除琐碎查询并对不可解查询进行子采样)、自适应组大小分配以及课程构建(按从易到难调度查询)。sGPO在匹配或超越基线性能的同时,将总训练计算量减少三倍(包含前置推理性能分析成本)。