Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.
翻译:可验证奖励的强化学习已成为提升大语言模型推理能力的主流方法。然而,该方法面临计算与内存的根本性不对称:推演生成过程具有高度并行性且内存占用低,而策略更新则需密集通信且内存消耗大。为解决此问题,我们提出PODS(降采样策略优化),该方法通过仅对策略性选取的推演子集进行训练,将推演生成与策略更新解耦,在保持学习质量的同时显著降低更新成本。我们提出一种理论化的子集选取准则——最大方差降采样,以最大化奖励多样性,并提供一种高效的$O(n\log n)$实现。实证表明,采用PODS的组相对策略优化在不同推理基准测试和硬件配置中,达到原始GRPO峰值测试准确率的速度至少提升$\mathbf{1.7\times}$。