Reinforcement learning (RL) presents a promising framework to learn policies through environment interaction, but often requires an infeasible amount of interaction data to solve complex tasks from sparse rewards. One direction includes augmenting RL with offline data demonstrating desired tasks, but past work often require a lot of high-quality demonstration data that is difficult to obtain, especially for domains such as robotics. Our approach consists of a reverse curriculum followed by a forward curriculum. Unique to our approach compared to past work is the ability to efficiently leverage more than one demonstration via a per-demonstration reverse curriculum generated via state resets. The result of our reverse curriculum is an initial policy that performs well on a narrow initial state distribution and helps overcome difficult exploration problems. A forward curriculum is then used to accelerate the training of the initial policy to perform well on the full initial state distribution of the task and improve demonstration and sample efficiency. We show how the combination of a reverse curriculum and forward curriculum in our method, RFCL, enables significant improvements in demonstration and sample efficiency compared against various state-of-the-art learning-from-demonstration baselines, even solving previously unsolvable tasks that require high precision and control.
翻译:强化学习(RL)通过环境交互学习策略的框架富有前景,但往往需要不可行的交互数据量才能从稀疏奖励中解决复杂任务。一种方向是利用展示目标任务的离线数据增强RL,但以往工作通常需要大量难以获取的高质量示范数据,尤其在机器人等领域。我们的方法包含反向课程与随后的前向课程。与以往工作相比,我们方法的核心创新在于能够通过状态重置生成每个示范对应的反向课程,从而高效利用多个示范。反向课程生成的初始策略可在狭窄初始状态分布上表现优异,并帮助克服困难探索问题。随后,前向课程用于加速初始策略的训练,使其在任务的完整初始状态分布上表现优异,并提升示范与样本效率。我们展示了方法RFCL中反向课程与前向课程的结合,相较于多种最先进的从示范学习基线方法,能够在示范与样本效率上实现显著提升,甚至可解决此前无法完成的需高精度与高控制要求的任务。