Despite the considerable potential of reinforcement learning (RL), robotics control tasks predominantly rely on imitation learning (IL) owing to its better sample efficiency. However, given the high cost of collecting extensive demonstrations, RL is still appealing if it can utilize limited imitation data for efficient autonomous self-improvement. Existing RL methods that utilize demonstrations either initialize the replay buffer with demonstrations and oversample them during RL training, which does not benefit from the generalization potential of modern IL methods, or pretrain the RL policy with IL on the demonstrations, which requires additional mechanisms to prevent catastrophic forgetting during RL fine-tuning. We propose imitation bootstrapped reinforcement learning (IBRL), a novel framework that first trains an IL policy on a limited number of demonstrations and then uses it to propose alternative actions for both online exploration and target value bootstrapping. IBRL achieves SoTA performance and sample efficiency on 7 challenging sparse reward continuous control tasks in simulation while learning directly from pixels. As a highlight of our method, IBRL achieves $6.4\times$ higher success rate than RLPD, a strong method that combines the idea of oversampling demonstrations with modern RL improvements, under the budget of 10 demos and 100K interactions in the challenging PickPlaceCan task in the Robomimic benchmark.
翻译:尽管强化学习(RL)具有巨大潜力,但机器人控制任务主要依赖模仿学习(IL),因其样本效率更高。然而,考虑到收集大量示范数据的高昂成本,如果能利用有限的模仿数据进行高效的自主改进,RL仍具吸引力。现有利用示范的RL方法要么用示范初始化回放缓冲区并在RL训练过程中对其过采样(这无法受益于现代IL方法的泛化潜力),要么用IL在示范上预训练RL策略(这需要额外机制来防止RL微调时的灾难性遗忘)。我们提出模仿引导的强化学习(IBRL),这是一种新颖框架:首先在有限数量的示范上训练IL策略,然后将其用于在线探索和目标值引导的替代动作提案。IBRL在模拟环境中直接从像素学习,在7个具有挑战性的稀疏奖励连续控制任务上实现了最先进的性能和样本效率。作为我们方法的亮点,在Robomimic基准测试中具有挑战性的PickPlaceCan任务上,使用10个示范和10万次交互的预算,IBRL的成功率比结合过采样示范思想与现代RL改进的强方法RLPD高出6.4倍。