Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs). However, due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70\% of the total training time. In this work, we propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout. We address two challenges in QuRL. First, we propose Adaptive Clipping Range (ACR) that dynamically adjusts the clipping ratio based on the policy ratio between the full-precision actor and the quantized actor, which is essential for mitigating long-term training collapse. Second, we identify the weight update problem, where weight changes between RL steps are extremely small, making it difficult for the quantization operation to capture them effectively. We mitigate this problem through the invariant scaling technique that reduces quantization noise and increases weight update. We evaluate our method with INT8 and FP8 quantization experiments on DeepScaleR and DAPO, and achieve 20% to 80% faster rollout during training.
翻译:可验证奖励的强化学习已成为训练推理型大语言模型的前沿范式。然而,由于大语言模型的自回归解码特性,推演过程成为强化学习训练的效率瓶颈,其耗时可达总训练时间的70%。本研究提出量化强化学习方法,通过使用量化执行器加速推演过程。我们解决了QuRL中的两个关键挑战:首先,提出自适应裁剪范围方法,该方法基于全精度执行器与量化执行器之间的策略比率动态调整裁剪比例,这对避免长期训练崩溃至关重要;其次,我们识别出权重更新问题——强化学习步骤间的权重变化极小,导致量化操作难以有效捕捉这些变化。通过引入不变缩放技术来降低量化噪声并增强权重更新,从而缓解该问题。我们在DeepScaleR和DAPO上分别进行INT8和FP8量化实验评估,实现了训练期间推演速度提升20%至80%。