Visual imitation learning methods demonstrate strong performance, yet they lack generalization when faced with visual input perturbations, including variations in lighting and textures, impeding their real-world application. We propose Stem-OB that utilizes pretrained image diffusion models to suppress low-level visual differences while maintaining high-level scene structures. This image inversion process is akin to transforming the observation into a shared representation, from which other observations stem, with extraneous details removed. Stem-OB contrasts with data-augmentation approaches as it is robust to various unspecified appearance changes without the need for additional training. Our method is a simple yet highly effective plug-and-play solution. Empirical results confirm the effectiveness of our approach in simulated tasks and show an exceptionally significant improvement in real-world applications, with an average increase of 22.2% in success rates compared to the best baseline. See https://hukz18.github.io/Stem-Ob/ for more info.
翻译:视觉模仿学习方法展现出强大的性能,但在面对视觉输入扰动(包括光照和纹理变化)时缺乏泛化能力,这阻碍了其在实际场景中的应用。我们提出Stem-OB,该方法利用预训练的图像扩散模型来抑制低层视觉差异,同时保持高层场景结构。这一图像逆变换过程类似于将观测转换为一种共享表示,其他观测均源于此表示,且无关细节已被移除。Stem-OB与数据增强方法不同,它无需额外训练即可对各种未指定的外观变化保持鲁棒性。我们的方法是一种简单而高效的即插即用解决方案。实证结果证实了该方法在模拟任务中的有效性,并在实际应用中显示出极其显著的性能提升,其平均成功率较最佳基线方法提高了22.2%。更多信息请参见 https://hukz18.github.io/Stem-Ob/。