Deep reinforcement learning with domain randomization learns a control policy in various simulations with randomized physical and sensor model parameters to become transferable to the real world in a zero-shot setting. However, a huge number of samples are often required to learn an effective policy when the range of randomized parameters is extensive due to the instability of policy updates. To alleviate this problem, we propose a sample-efficient method named cyclic policy distillation (CPD). CPD divides the range of randomized parameters into several small sub-domains and assigns a local policy to each one. Then local policies are learned while cyclically transitioning to sub-domains. CPD accelerates learning through knowledge transfer based on expected performance improvements. Finally, all of the learned local policies are distilled into a global policy for sim-to-real transfers. CPD's effectiveness and sample efficiency are demonstrated through simulations with four tasks (Pendulum from OpenAIGym and Pusher, Swimmer, and HalfCheetah from Mujoco), and a real-robot, ball-dispersal task. We published code and videos from our experiments at https://github.com/yuki-kadokawa/cyclic-policy-distillation.
翻译:深度强化学习结合域随机化通过在多种仿真环境中对物理和传感器模型参数进行随机化来学习控制策略,从而在零样本条件下实现向真实世界的迁移。然而,当随机化参数范围较大时,由于策略更新的不稳定性,往往需要大量样本才能学习到有效策略。为解决这一问题,我们提出了一种名为循环策略蒸馏(CPD)的样本高效方法。CPD将随机化参数范围划分为多个小子域,并为每个子域分配一个局部策略。随后,在循环切换子域的过程中学习局部策略。CPD通过基于预期性能改进的知识迁移来加速学习。最后,将所有学到的局部策略蒸馏为一个全局策略,用于仿真到现实的迁移。通过四项任务(OpenAIGym中的倒立摆,以及Mujoco中的推杆器、游泳器和半猎豹)的仿真实验,以及真实机器人的球体分散任务实验,验证了CPD的有效性和样本效率。我们在https://github.com/yuki-kadokawa/cyclic-policy-distillation 发布了实验代码和视频。