The existing internet-scale image and video datasets cover a wide range of everyday objects and tasks, bringing the potential of learning policies that have broad generalization. Prior works have explored visual pre-training with different self-supervised objectives, but the generalization capabilities of the learned policies remain relatively unknown. In this work, we take the first step towards this challenge, focusing on how pre-trained representations can help the generalization of the learned policies. We first identify the key bottleneck in using a frozen pre-trained visual backbone for policy learning. We then propose SpawnNet, a novel two-stream architecture that learns to fuse pre-trained multi-layer representations into a separate network to learn a robust policy. Through extensive simulated and real experiments, we demonstrate significantly better categorical generalization compared to prior approaches in imitation learning settings.
翻译:现有的互联网规模图像和视频数据集涵盖了广泛的日常物体和任务,为学习具有广泛泛化能力的策略提供了可能性。以往的研究探索了使用不同自监督目标进行视觉预训练,但所学策略的泛化能力仍相对未知。在本工作中,我们迈出了应对这一挑战的第一步,聚焦于预训练表示如何帮助所学策略的泛化。我们首先确定了使用冻结的预训练视觉骨干进行策略学习的关键瓶颈。随后提出SpawnNet,一种新颖的双流架构,通过学习将预训练的多层表示融合到独立的网络中,从而学习鲁棒策略。通过大量仿真和真实实验,我们证明在模仿学习设置中,该方法相比先前方法显著提升了类别层面的泛化性能。