并非所有推演都有用：大语言模型强化学习中的推演降采样 (Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning)

Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.

翻译：可验证奖励的强化学习已成为提升大语言模型推理能力的主流方法。然而，该方法面临计算与内存的根本性不对称：推演生成过程具有高度并行性且内存占用低，而策略更新则需密集通信且内存消耗大。为解决此问题，我们提出PODS（降采样策略优化），该方法通过仅对策略性选取的推演子集进行训练，将推演生成与策略更新解耦，在保持学习质量的同时显著降低更新成本。我们提出一种理论化的子集选取准则——最大方差降采样，以最大化奖励多样性，并提供一种高效的$O(n\log n)$实现。实证表明，采用PODS的组相对策略优化在不同推理基准测试和硬件配置中，达到原始GRPO峰值测试准确率的速度至少提升$\mathbf{1.7\times}$。