Vision-based imitation learning has shown promising capabilities of endowing robots with various motion skills given visual observation. However, current visuomotor policies fail to adapt to drastic changes in their visual observations. We present Perception Stitching that enables strong zero-shot adaptation to large visual changes by directly stitching novel combinations of visual encoders. Our key idea is to enforce modularity of visual encoders by aligning the latent visual features among different visuomotor policies. Our method disentangles the perceptual knowledge with the downstream motion skills and allows the reuse of the visual encoders by directly stitching them to a policy network trained with partially different visual conditions. We evaluate our method in various simulated and real-world manipulation tasks. While baseline methods failed at all attempts, our method could achieve zero-shot success in real-world visuomotor tasks. Our quantitative and qualitative analysis of the learned features of the policy network provides more insights into the high performance of our proposed method.
翻译:基于视觉的模仿学习已展现出赋予机器人根据视觉观察掌握各种运动技能的良好能力。然而,当前的视觉运动策略无法适应其视觉观测的剧烈变化。我们提出了感知缝合方法,该方法通过直接缝合视觉编码器的新颖组合,实现了对大幅视觉变化的强大零样本适应能力。我们的核心思想是通过对齐不同视觉运动策略之间的潜在视觉特征来强制实现视觉编码器的模块化。我们的方法将感知知识与下游运动技能解耦,并允许通过将视觉编码器直接缝合到在部分不同视觉条件下训练的策略网络中来重用这些编码器。我们在多种模拟和现实世界操作任务中评估了我们的方法。虽然基线方法在所有尝试中均告失败,但我们的方法能够在现实世界视觉运动任务中实现零样本成功。我们对策略网络所学特征的定量和定性分析,为我们所提方法的高性能提供了更深入的见解。