We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actions. By applying multi-object tracking and video object segmentation on the images collected via robot pushing, our system can generate segmentation masks of all the objects in these images in a self-supervised way. These include images where objects are very close to each other, and segmentation errors usually occur on these images for existing object segmentation networks. We demonstrate the usefulness of our system by fine-tuning segmentation networks trained on synthetic data with real-world data collected by our system. We show that, after fine-tuning, the segmentation accuracy of the networks is significantly improved both in the same domain and across different domains. In addition, we verify that the fine-tuned networks improve top-down robotic grasping of unseen objects in the real world.
翻译:我们提出了一种新颖的机器人系统,通过利用长期机器人与物体的交互,在真实世界中提升未见物体实例分割的性能。以往的方法通常是在一次抓取或推动动作后,获取被操作物体的分割掩码。相反,我们的系统在完成一系列机器人推动动作之后,再决定物体的分割。通过对机器人推动过程中采集的图像应用多目标跟踪和视频目标分割,我们的系统能够以自监督方式生成这些图像中所有物体的分割掩码。这些图像包含物体彼此紧密相邻的场景,而现有的物体分割网络在此类图像上常出现分割误差。我们通过使用系统采集的真实世界数据对基于合成数据训练的分割网络进行微调,证明了系统的有效性。结果显示,微调后网络的分割准确率在相同领域及跨领域条件下均显著提升。此外,我们验证了微调后的网络能够提高真实世界中自上而下抓取未见物体的成功率。