Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision

Learning-based methods in robotics hold the promise of generalization, but what can be done if a learned policy does not generalize to a new situation? In principle, if an agent can at least evaluate its own success (i.e., with a reward classifier that generalizes well even when the policy does not), it could actively practice the task and finetune the policy in this situation. We study this problem in the setting of industrial insertion tasks, such as inserting connectors in sockets and setting screws. Existing algorithms rely on precise localization of the connector or socket and carefully managed physical setups, such as assembly lines, to succeed at the task. But in unstructured environments such as homes or even some industrial settings, robots cannot rely on precise localization and may be tasked with previously unseen connectors. Offline reinforcement learning on a variety of connector insertion tasks is a potential solution, but what if the robot is tasked with inserting previously unseen connector? In such a scenario, we will still need methods that can robustly solve such tasks with online practice. One of the main observations we make in this work is that, with a suitable representation learning and domain generalization approach, it can be significantly easier for the reward function to generalize to a new but structurally similar task (e.g., inserting a new type of connector) than for the policy. This means that a learned reward function can be used to facilitate the finetuning of the robot's policy in situations where the policy fails to generalize in zero shot, but the reward function generalizes successfully. We show that such an approach can be instantiated in the real world, pretrained on 50 different connectors, and successfully finetuned to new connectors via the learned reward function. Videos can be viewed at https://sites.google.com/view/learningonthejob

翻译：机器人学中的学习方法有望实现泛化，但当习得策略无法泛化到新场景时该怎么办？原则上，若智能体至少能评估自身成功程度（即使用即使策略失效时仍能良好泛化的奖励分类器），它就可以主动练习该任务并在此场景下微调策略。本研究在工业装配任务（如将连接器插入插座、拧螺丝等）场景中探讨该问题。现有算法依赖连接器或插座的精准定位，以及精心管控的物理设置（如装配线）才能成功执行任务。但在家庭或甚至某些工业环境中，机器人无法依赖精准定位，可能需处理从未见过的连接器。采用多种连接器插入任务的离线强化学习是一种潜在解决方案，但若机器人需插入从未见过的连接器又该如何？在此类场景下，我们仍需能通过在线实践稳健解决此类任务的方法。本研究的一个核心观察是：通过合适的表征学习与领域泛化方法，奖励函数对新型但结构相似任务（例如插入新型连接器）的泛化难度远低于策略本身。这意味着当策略无法零样本泛化时，可利用习得奖励函数促进机器人策略微调，而奖励函数仍能成功泛化。我们证明该方法可在现实场景中实例化：先在50种不同连接器上预训练，再通过习得奖励函数成功微调至新连接器。视频演示请见https://sites.google.com/view/learningonthejob