Hindsight Experience Replay (HER) is a technique used in reinforcement learning (RL) that has proven to be very efficient for training off-policy RL-based agents to solve goal-based robotic manipulation tasks using sparse rewards. Even though HER improves the sample efficiency of RL-based agents by learning from mistakes made in past experiences, it does not provide any guidance while exploring the environment. This leads to very large training times due to the volume of experience required to train an agent using this replay strategy. In this paper, we propose a method that uses primitive behaviours that have been previously learned to solve simple tasks in order to guide the agent toward more rewarding actions during exploration while learning other more complex tasks. This guidance, however, is not executed by a manually designed curriculum, but rather using a critic network to decide at each timestep whether or not to use the actions proposed by the previously-learned primitive policies. We evaluate our method by comparing its performance against HER and other more efficient variations of this algorithm in several block manipulation tasks. We demonstrate the agents can learn a successful policy faster when using our proposed method, both in terms of sample efficiency and computation time. Code is available at https://github.com/franroldans/qmp-her.
翻译:后见经验回放(HER)是一种强化学习(RL)技术,已被证明能有效训练基于离策略RL的智能体,以使用稀疏奖励解决基于目标的任务中的机器人操作问题。尽管HER通过从过去经验中的错误学习提高了RL智能体的样本效率,但在探索环境时并未提供任何指导。这导致训练时间非常长,因为使用这种回放策略训练智能体需要大量经验。在本文中,我们提出了一种方法,该方法利用先前学习到的用于解决简单任务的原始行为,在探索过程中引导智能体采取更有价值的动作,同时学习其他更复杂的任务。然而,这种引导并非通过手动设计的课程执行,而是使用一个评论家网络在每个时间步决定是否采用先前学习的原始策略提出的动作。我们通过将所提方法的性能与HER及该算法其他更高效的变体在多个块操作任务中进行比较来评估我们的方法。我们证明,使用所提方法时,智能体在样本效率和计算时间两方面都能更快地学习到成功策略。代码可在 https://github.com/franroldans/qmp-her 获取。