Reinforcement learning often suffer from the sparse reward issue in real-world robotics problems. Learning from demonstration (LfD) is an effective way to eliminate this problem, which leverages collected expert data to aid online learning. Prior works often assume that the learning agent and the expert aim to accomplish the same task, which requires collecting new data for every new task. In this paper, we consider the case where the target task is mismatched from but similar with that of the expert. Such setting can be challenging and we found existing LfD methods can not effectively guide learning in mismatched new tasks with sparse rewards. We propose conservative reward shaping from demonstration (CRSfD), which shapes the sparse rewards using estimated expert value function. To accelerate learning processes, CRSfD guides the agent to conservatively explore around demonstrations. Experimental results of robot manipulation tasks show that our approach outperforms baseline LfD methods when transferring demonstrations collected in a single task to other different but similar tasks.
翻译:在现实世界机器人问题中,强化学习常面临稀疏奖励的挑战。从演示中学习(LfD)是一种消除该问题的有效方法,它利用收集到的专家数据辅助在线学习。以往研究通常假设学习智能体与专家旨在完成相同任务,这要求为每个新任务重新采集数据。本文考虑目标任务与专家任务不匹配但存在相似性的情形。该设置极具挑战性,我们发现现有LfD方法在稀疏奖励下无法有效指导不匹配新任务的学习。我们提出基于演示的保守奖励塑造方法(CRSfD),通过估计专家价值函数对稀疏奖励进行塑性。为加速学习进程,CRSfD引导智能体在演示附近进行保守探索。机器人操作任务的实验结果表明,当将单任务中采集的演示迁移至其他不同但相似的任务时,我们的方法优于基线LfD方法。