Deep reinforcement learning (DRL) algorithms require substantial samples and computational resources to achieve higher performance, which restricts their practical application and poses challenges for further development. Given the constraint of limited resources, it is essential to leverage existing computational work (e.g., learned policies, samples) to enhance sample efficiency and reduce the computational resource consumption of DRL algorithms. Previous works to leverage existing computational work require intrusive modifications to existing algorithms and models, designed specifically for specific algorithms, lacking flexibility and universality. In this paper, we present the Snapshot Reinforcement Learning (SnapshotRL) framework, which enhances sample efficiency by simply altering environments, without making any modifications to algorithms and models. By allowing student agents to choose states in teacher trajectories as the initial state to sample, SnapshotRL can effectively utilize teacher trajectories to assist student agents in training, allowing student agents to explore a larger state space at the early training phase. We propose a simple and effective SnapshotRL baseline algorithm, S3RL, which integrates well with existing DRL algorithms. Our experiments demonstrate that integrating S3RL with TD3, SAC, and PPO algorithms on the MuJoCo benchmark significantly improves sample efficiency and average return, without extra samples and additional computational resources.
翻译:深度强化学习算法需要大量的样本和计算资源才能达到较高性能,这限制了其实际应用,并给进一步发展带来了挑战。在资源有限的约束下,充分利用已有的计算工作(如学习策略、样本)来提高样本效率、降低深度强化学习算法的计算资源消耗至关重要。以往利用现有计算工作的方法需要对现有算法和模型进行侵入式修改,且专门针对特定算法设计,缺乏灵活性和通用性。本文提出了快照强化学习框架,通过简单改变环境来提升样本效率,无需对算法和模型做任何修改。通过允许学生代理选择教师轨迹中的状态作为采样的初始状态,快照强化学习能够有效利用教师轨迹辅助学生代理进行训练,使其在训练初期探索更大的状态空间。我们提出了一种简单有效的快照强化学习基线算法——S3RL,该算法能很好地与现有深度强化学习算法集成。实验表明,在MuJoCo基准测试上将S3RL与TD3、SAC和PPO算法集成后,在不增加额外样本和计算资源的情况下,显著提升了样本效率和平均回报。