Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60\% on views that the non-augmented policy fails completely on.
翻译:基于视觉的机器人操作策略近年来取得了显著成功,但对相机视角变化等分布偏移仍显脆弱。机器人演示数据稀缺且通常缺乏足够的相机视角变化。仿真为大规模收集机器人演示数据提供了途径,能够全面覆盖不同视角,但带来了视觉仿真到现实的迁移挑战。为弥合这一差距,我们提出MANGO——一种基于非配对图像翻译的方法,该方法采用新颖的分割条件InfoNCE损失函数、高度正则化的判别器设计以及改进的PatchNCE损失。我们发现这些要素对于在仿真到现实翻译过程中保持视角一致性至关重要。训练MANGO时,我们仅需少量来自真实世界的固定摄像头数据,但实验表明我们的方法能够通过转换仿真观测生成多样化的未见视角。在该领域中,MANGO的表现优于我们测试的所有其他图像翻译方法。基于MANGO增强数据训练的模仿学习策略,在未经增强策略完全失败的视角上,成功率最高可达60\%。