Natural Language Can Help Bridge the Sim2Real Gap

The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a considerable amount of visual data. However, when learning in the real world, data is expensive. Sim2Real is a promising paradigm for overcoming data scarcity in the real-world target domain by using a simulator to collect large amounts of cheap data closely related to the target task. However, it is difficult to transfer an image-conditioned policy from sim to real when the domains are very visually dissimilar. To bridge the sim2real visual gap, we propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Our key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. We demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data-efficient pretraining step that helps learn a domain-invariant image representation. We can then use this image encoder as the backbone of an IL policy trained simultaneously on a large amount of simulated and a handful of real demonstrations. Our approach outperforms widely used prior sim2real methods and strong vision-language pretraining baselines like CLIP and R3M by 25 to 40%. See additional videos and materials at https://robin-lab.cs.utexas.edu/lang4sim2real/.

翻译：学习图像条件机器人策略的主要挑战在于获取有利于低级控制的视觉表征。由于图像空间的高维度特性，学习良好的视觉表征需要大量的视觉数据。然而，在现实世界中学习时，数据获取成本高昂。仿真到现实（Sim2Real）是一种有前景的范式，它通过使用仿真器收集大量与目标任务密切相关的低成本数据，以克服现实目标领域中的数据稀缺问题。但是，当仿真与现实领域的视觉特征差异显著时，将图像条件策略从仿真环境迁移到现实世界变得十分困难。为弥合仿真与现实间的视觉鸿沟，我们提出使用图像的自然语言描述作为跨领域的统一信号，以捕捉潜在的任务相关语义。我们的核心见解是：若来自不同领域的两个图像观测被标注为相似的语言描述，则策略应为这两幅图像预测相似的动作分布。我们证明，训练图像编码器以预测仿真或现实图像的语言描述或描述间距离，可作为一种高效的数据预训练步骤，有助于学习领域不变的图像表征。随后，我们可以将该图像编码器作为模仿学习策略的骨干网络，同时在大量仿真演示和少量现实演示上进行训练。我们的方法比广泛使用的现有仿真到现实方法以及CLIP、R3M等强视觉语言预训练基线性能提升25%至40%。更多视频与材料请访问：https://robin-lab.cs.utexas.edu/lang4sim2real/。