Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this setting, MANGO outperforms all other image translation methods we tested. In certain real-world tabletop manipulation tasks, MANGO augmentation increases shifted-view success rates by over 40 percentage points compared to policies trained without augmentation.
翻译:基于视觉的机器人操作策略近年来取得了显著成功,但仍对诸如摄像头视角变化等分布偏移较为敏感。机器人演示数据稀缺,且通常缺乏摄像头视角的适当变化。仿真为大规模收集机器人演示数据提供了途径,能够全面覆盖不同视角,但带来了视觉仿真到真实的挑战。为弥合这一差距,我们提出MANGO——一种非配对图像转换方法,其具备新颖的分割条件InfoNCE损失、高度正则化的判别器设计以及改进的PatchNCE损失。我们发现这些要素对于在仿真到真实转换过程中保持视角一致性至关重要。训练MANGO时,我们仅需少量来自真实世界的固定摄像头数据,但实验表明我们的方法能够通过转换仿真观测生成多样化的未见视角。在此设定下,MANGO优于我们测试的所有其他图像转换方法。在特定真实世界桌面操作任务中,与未经增强训练的策略相比,MANGO数据增强可将偏移视角下的成功率提升超过40个百分点。