In the search for more sample-efficient reinforcement-learning (RL) algorithms, a promising direction is to leverage as much external off-policy data as possible. For instance, expert demonstrations. In the past, multiple ideas have been proposed to make good use of the demonstrations added to the replay buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We present a new method, able to leverage both demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm. Our method is based on a reward bonus given to demonstrations and successful episodes (via relabeling), encouraging expert imitation and self-imitation. Our experiments focus on several robotic-manipulation tasks across two different simulation environments. We show that our method based on reward relabeling improves the performance of the base algorithm (SAC and DDPG) on these tasks. Finally, our best algorithm STIR$^2$ (Self and Teacher Imitation by Reward Relabeling), which integrates into our method multiple improvements from previous works, is more data-efficient than all baselines.
翻译:在探索更高效样本的强化学习算法过程中,一个颇具前景的方向是充分利用外部离线数据(例如专家示范)。过去已有多种方法被提出以有效利用添加至经验回放缓冲区的示范数据,例如仅通过示范进行预训练或最小化额外代价函数。我们提出一种新方法,能够结合在线采集的示范轨迹和交互片段,适用于任意稀疏奖励环境下的离线策略算法。该方法基于对示范轨迹和成功片段(通过重标注)赋予奖励加成,从而促进专家模仿与自我模仿。实验聚焦于两个不同仿真环境中的多项机器人操作任务,结果表明,基于奖励重标注的方法在SAC和DDPG基础算法上均提升了性能。最终,整合了过往研究多项改进的最优算法STIR$^2$(通过奖励重标注实现自我与教师模仿),在数据效率上全面超越所有基线方法。