Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.
翻译:强化学习(RL),特别是基于可验证奖励的强化学习(RLVR),已成为训练大型语言模型(LLMs)的关键阶段,也是当前扩展努力的核心焦点。然而,尽管近期研究强调了RL与后续阶段(如预训练和监督微调)之间存在根本差异,但RL中的优化实践在很大程度上仍遵循这些后续阶段的惯例。其中一个惯例是广泛采用AdamW优化器来训练大规模Transformer模型,尽管其内存开销较高。我们的分析表明,AdamW中的动量和自适应学习率在RL中的影响远小于在监督微调(SFT)中,这使我们假设RL从Adam式的逐参数自适应学习率和动量中获益较少。实验证实了这一假设:在LLMs的RL任务中,内存效率显著更高的SGD(已知其在大规模Transformer的监督学习中表现不佳)达到甚至超越了AdamW的性能。值得注意的是,使用SGD进行全参数微调时,更新的模型参数少于0.02%,且无需任何促进稀疏性的正则化,更新参数数量比AdamW减少超过1000倍。我们的分析为此更新稀疏性提供了潜在原因。这些发现为LLMs中RL的优化动态提供了新的见解,并表明RL可能比以往认知的更具参数效率。