Offline-to-online reinforcement learning (O2O RL) aims to obtain a continually improving policy as it interacts with the environment, while ensuring the initial policy behaviour is satisficing. This satisficing behaviour is necessary for robotic manipulation where random exploration can be costly due to catastrophic failures and time. O2O RL is especially compelling when we can only obtain a scarce amount of (potentially suboptimal) demonstrations$\unicode{x2014}$a scenario where behavioural cloning (BC) is known to suffer from distribution shift. Previous works have outlined the challenges in applying O2O RL algorithms under the image-based environments. In this work, we propose a novel O2O RL algorithm that can learn in a real-life image-based robotic vacuum grasping task with a small number of demonstrations where BC fails majority of the time. The proposed algorithm replaces the target network in off-policy actor-critic algorithms with a regularization technique inspired by neural tangent kernel. We demonstrate that the proposed algorithm can reach above 90\% success rate in under two hours of interaction time, with only 50 human demonstrations, while BC and existing commonly-used RL algorithms fail to achieve similar performance.
翻译:离线到在线强化学习旨在通过与环境的持续交互获得一个不断改进的策略,同时确保初始策略行为是令人满意的。这种令人满意的行为对于机器人操作是必要的,因为在机器人操作中,由于灾难性故障和时间成本,随机探索的代价可能很高。当我们只能获得少量(可能次优的)演示时——行为克隆已知会在此场景下因分布偏移而性能不佳——离线到在线强化学习尤其具有吸引力。先前的研究已经概述了在基于图像的环境中应用离线到在线强化学习算法所面临的挑战。在本工作中,我们提出了一种新颖的离线到在线强化学习算法,该算法能够在行为克隆大多数情况下失败的、演示数量很少的真实基于图像机器人真空抓取任务中学习。所提出的算法用受神经正切核启发的正则化技术替代了离策略演员-评论家算法中的目标网络。我们证明,所提出的算法仅需不到两小时的交互时间和50个人类演示,即可达到90%以上的成功率,而行为克隆和现有常用强化学习算法均无法达到类似性能。