Deep Reinforcement Learning has been successfully applied to learn robotic control. However, the corresponding algorithms struggle when applied to problems where the agent is only rewarded after achieving a complex task. In this context, using demonstrations can significantly speed up the learning process, but demonstrations can be costly to acquire. In this paper, we propose to leverage a sequential bias to learn control policies for complex robotic tasks using a single demonstration. To do so, our method learns a goal-conditioned policy to control a system between successive low-dimensional goals. This sequential goal-reaching approach raises a problem of compatibility between successive goals: we need to ensure that the state resulting from reaching a goal is compatible with the achievement of the following goals. To tackle this problem, we present a new algorithm called DCIL-II. We show that DCIL-II can solve with unprecedented sample efficiency some challenging simulated tasks such as humanoid locomotion and stand-up as well as fast running with a simulated Cassie robot. Our method leveraging sequentiality is a step towards the resolution of complex robotic tasks under minimal specification effort, a key feature for the next generation of autonomous robots.
翻译:深度强化学习已成功应用于机器人控制学习。然而,当应用于仅完成复杂任务后才给予奖励的问题时,相应算法面临挑战。在此背景下,使用示范可以显著加速学习过程,但获取示范的成本可能很高。本文提出利用序列偏差,通过单个示范学习复杂机器人控制策略。为此,我们的方法学习一种以目标为条件的策略,在连续的低维目标之间控制系统。这种序列目标到达方法引发了一个连续目标之间的兼容性问题:我们需要确保达到一个目标后产生的状态与后续目标的达成相兼容。为解决此问题,我们提出了一种新算法DCIL-II。实验表明,DCIL-II能够在一些具有挑战性的模拟任务中以前所未有的样本效率实现学习,例如人形机器人行走与站立,以及模拟Cassie机器人的快速奔跑。我们的方法利用序列性,是在最小规格工作下解决复杂机器人任务道路上迈出的一步,这对下一代自主机器人而言是一个关键特性。