Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60% on views that the non-augmented policy fails completely on.
翻译:基于视觉的机器人操作策略近年来取得了显著成功,但仍对相机视角变化等分布偏移较为敏感。机器人演示数据稀缺且通常缺乏相机视角的适当变化。仿真为收集大规模机器人演示数据提供了途径,能够全面覆盖不同视角,但面临着视觉仿真到现实的迁移挑战。为弥合这一差距,我们提出MANGO——一种基于非配对图像翻译的方法,其创新点包括:新型分割条件InfoNCE损失函数、高度正则化的判别器设计以及改进的PatchNCE损失。我们发现这些要素对于保持仿真到现实翻译过程中的视角一致性至关重要。训练MANGO时仅需少量真实世界的固定摄像头数据,但实验表明我们的方法能够通过转换仿真观测生成多样化的未见视角。在该研究领域中,MANGO的表现优于我们测试的所有其他图像翻译方法。基于MANGO增强数据训练的模仿学习策略,在未经增强策略完全失败的视角上能够实现高达60%的成功率。