Offline-to-online reinforcement learning (O2O RL) aims to obtain a continually improving policy as it interacts with the environment, while ensuring the initial behaviour is satisficing. This satisficing behaviour is necessary for robotic manipulation where random exploration can be costly due to catastrophic failures and time. O2O RL is especially compelling when we can only obtain a scarce amount of (potentially suboptimal) demonstrations$\unicode{x2014}$a scenario where behavioural cloning (BC) is known to suffer from distribution shift. Previous works have outlined the challenges in applying O2O RL algorithms under the image-based environments. In this work, we propose a novel O2O RL algorithm that can learn in a real-life image-based robotic vacuum grasping task with a small number of demonstrations where BC fails majority of the time. The proposed algorithm replaces the target network in off-policy actor-critic algorithms with a regularization technique inspired by neural tangent kernel. We demonstrate that the proposed algorithm can reach above 90% success rate in under two hours of interaction time, with only 50 human demonstrations, while BC and two commonly-used RL algorithms fail to achieve similar performance.
翻译:离线到在线强化学习旨在通过与环境的交互获得持续改进的策略,同时确保初始行为是令人满意的。对于机器人操作任务而言,这种令人满意的行为是必要的,因为随机探索可能因灾难性故障和时间成本而代价高昂。当我们只能获取少量(可能次优)演示时——这是行为克隆已知会因分布偏移而失效的场景——离线到在线强化学习尤其具有吸引力。先前的研究已经概述了在基于图像的环境中应用离线到在线强化学习算法所面临的挑战。在本工作中,我们提出了一种新颖的离线到在线强化学习算法,该算法能够在真实世界的基于图像机器人真空抓取任务中,仅使用少量演示(行为克隆在大多数情况下会失败)进行学习。所提出的算法使用一种受神经正切核启发的正则化技术,替代了离策略行动者-评论家算法中的目标网络。我们证明,所提出的算法在不到两小时的交互时间内,仅使用50个人类演示,即可达到90%以上的成功率,而行为克隆和两种常用的强化学习算法均无法实现类似性能。