Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.
翻译:强化学习已成为激发大语言模型复杂推理能力的关键技术。然而,在长程推演过程中存储键值缓存所带来的巨大内存开销构成了关键瓶颈,往往阻碍了在有限硬件上的高效训练。虽然现有的键值压缩技术为推理提供了解决方案,但将其直接应用于强化学习训练会引发严重的策略失配,导致灾难性的性能崩溃。为解决此问题,我们提出了Sparse-RL,该框架支持在稀疏推演下进行稳定的强化学习训练。我们证明,不稳定性源于稠密旧策略、稀疏采样策略与学习策略之间的根本性策略失配。为缓解该问题,Sparse-RL引入了稀疏感知拒绝采样与基于重要性的重加权机制,以修正由压缩引起的信息损失所带来的离策略偏差。实验结果表明,与稠密基线相比,Sparse-RL在保持性能的同时显著降低了推演开销。此外,Sparse-RL本质上实现了稀疏感知训练,在稀疏推理部署期间显著增强了模型的鲁棒性。