Hindsight Experience Replay (HER) is a technique used in reinforcement learning (RL) that has proven to be very efficient for training off-policy RL-based agents to solve goal-based robotic manipulation tasks using sparse rewards. Even though HER improves the sample efficiency of RL-based agents by learning from mistakes made in past experiences, it does not provide any guidance while exploring the environment. This leads to very large training times due to the volume of experience required to train an agent using this replay strategy. In this paper, we propose a method that uses primitive behaviours that have been previously learned to solve simple tasks in order to guide the agent toward more rewarding actions during exploration while learning other more complex tasks. This guidance, however, is not executed by a manually designed curriculum, but rather using a critic network to decide at each timestep whether or not to use the actions proposed by the previously-learned primitive policies. We evaluate our method by comparing its performance against HER and other more efficient variations of this algorithm in several block manipulation tasks. We demonstrate the agents can learn a successful policy faster when using our proposed method, both in terms of sample efficiency and computation time. Code is available at https://github.com/franroldans/qmp-her.
翻译:后见经验回放(HER)是一种强化学习(RL)技术,已被证明在训练基于离线策略的RL智能体解决基于稀疏奖励的目标导向机器人操作任务方面非常高效。尽管HER通过从过去经验中学习错误来提高RL智能体的样本效率,但它未能在环境探索过程中提供任何引导。这导致使用该回放策略训练智能体时需要大量经验,从而造成极长的训练时间。本文提出了一种方法,利用先前学习到的解决简单任务的原始行为,在探索过程中引导智能体采取更具奖励性的动作,同时学习其他更复杂的任务。然而,这种引导并非通过人工设计的课程执行,而是利用一个评判网络在每个时间步决定是否采用先前学习到的原始策略所提出的动作。我们通过将所提方法与HER及其它更高效的变体在多个积木操作任务中的性能进行比较来评估该方法。实验证明,使用我们提出的方法时,智能体能够更快地学习到有效策略,无论是在样本效率还是计算时间方面。代码已开源在https://github.com/franroldans/qmp-her。