The existing internet-scale image and video datasets cover a wide range of everyday objects and tasks, bringing the potential of learning policies that generalize in diverse scenarios. Prior works have explored visual pre-training with different self-supervised objectives. Still, the generalization capabilities of the learned policies and the advantages over well-tuned baselines remain unclear from prior studies. In this work, we present a focused study of the generalization capabilities of the pre-trained visual representations at the categorical level. We identify the key bottleneck in using a frozen pre-trained visual backbone for policy learning and then propose SpawnNet, a novel two-stream architecture that learns to fuse pre-trained multi-layer representations into a separate network to learn a robust policy. Through extensive simulated and real experiments, we show significantly better categorical generalization compared to prior approaches in imitation learning settings. Open-sourced code and videos can be found on our website: https://xingyu-lin.github.io/spawnnet.
翻译:现有的互联网级图像和视频数据集涵盖了广泛的日常物体和任务,使得学习能够在多样化场景中泛化的策略成为可能。先前的工作已探索了采用不同自监督目标进行视觉预训练的方法。然而,从先前研究中仍不明确,所学习策略的泛化能力以及相较于精心调优基线的优势。在本工作中,我们对预训练视觉表示在类别层面的泛化能力进行了聚焦性研究。我们识别了使用冻结的预训练视觉主干进行策略学习的关键瓶颈,进而提出SpawnNet——一种新颖的双流架构,该架构学习将预训练的多层表示融合到一个独立网络中,以学习鲁棒策略。通过广泛的仿真和实际实验,我们展示了在模仿学习场景下,相较于先前方法,在类别泛化方面取得了显著更优的性能。开源代码和视频可在我们的网站上获取:https://xingyu-lin.github.io/spawnnet。