Training deep reinforcement learning (DRL) models usually requires high computation costs. Therefore, compressing DRL models possesses immense potential for training acceleration and model deployment. However, existing methods that generate small models mainly adopt the knowledge distillation-based approach by iteratively training a dense network. As a result, the training process still demands massive computing resources. Indeed, sparse training from scratch in DRL has not been well explored and is particularly challenging due to non-stationarity in bootstrap training. In this work, we propose a novel sparse DRL training framework, "the Rigged Reinforcement Learning Lottery" (RLx2), which builds upon gradient-based topology evolution and is capable of training a sparse DRL model based entirely on a sparse network. Specifically, RLx2 introduces a novel multi-step TD target mechanism with a dynamic-capacity replay buffer to achieve robust value learning and efficient topology exploration in sparse models. It also reaches state-of-the-art sparse training performance in several tasks, showing 7.5\times-20\times model compression with less than 3% performance degradation and up to 20\times and 50\times FLOPs reduction for training and inference, respectively.
翻译:训练深度强化学习(DRL)模型通常需要高昂的计算成本。因此,压缩DRL模型在训练加速和模型部署方面具有巨大潜力。然而,现有生成小模型的方法主要采用基于知识蒸馏的方式,通过迭代训练稠密网络来实现,这使得训练过程仍需大量计算资源。事实上,在DRL中从零开始稀疏训练尚未得到充分探索,且由于自举训练中的非平稳性而极具挑战性。本文提出了一种新颖的稀疏DRL训练框架——“Rigged Reinforcement Learning Lottery”(RLx2),该框架基于梯度驱动拓扑演化,能够完全基于稀疏网络训练稀疏DRL模型。具体而言,RLx2引入了一种新颖的多步TD目标机制,结合动态容量回放缓冲区,以实现稀疏模型中的稳健值学习和高效拓扑探索。此外,该框架在多个任务中达到了最先进的稀疏训练性能,实现了7.5倍至20倍的模型压缩,且性能退化低于3%,训练和推理的FLOPs分别降低至原来的1/20和1/50。