Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.
翻译:在应用于大语言模型以解决推理任务的强化学习中,由于对提示的响应采用固定且均匀的采样方式,不稳定的梯度估计常常成为性能瓶颈。先前的研究如GVM-RAFT通过动态分配每个提示的推理预算,以在预算约束下最小化随机梯度方差来解决此问题。受此启发,我们提出了Reinforce-Ada,这是一个用于大语言模型在线强化学习后训练的自适应采样框架,它持续地将采样资源重新分配给具有最大不确定性或学习潜力的提示。与传统的两阶段分配方法不同,Reinforce-Ada在一个在线连续淘汰过程中交织进行估计和采样,并且一旦为某个提示收集到足够的信号,便会自动停止对其采样。为了稳定更新,我们构建了具有强制奖励多样性的固定大小分组,并使用在自适应采样阶段聚合的全局统计量来计算优势基线。在多种模型架构和推理基准测试上的实证结果表明,与GRPO相比,Reinforce-Ada加速了收敛并提升了最终性能,尤其是在使用平衡采样变体时。我们的工作突显了方差感知、自适应数据管理在实现具备推理能力的大语言模型的高效可靠强化学习中的核心作用。代码可在 https://github.com/RLHFlow/Reinforce-Ada 获取。