This work proposes a self-supervised learning system for segmenting rigid objects in RGB images. The proposed pipeline is trained on unlabeled RGB-D videos of static objects, which can be captured with a camera carried by a mobile robot. A key feature of the self-supervised training process is a graph-matching algorithm that operates on the over-segmentation output of the point cloud that is reconstructed from each video. The graph matching, along with point cloud registration, is able to find reoccurring object patterns across videos and combine them into 3D object pseudo labels, even under occlusions or different viewing angles. Projected 2D object masks from 3D pseudo labels are used to train a pixel-wise feature extractor through contrastive learning. During online inference, a clustering method uses the learned features to cluster foreground pixels into object segments. Experiments highlight the method's effectiveness on both real and synthetic video datasets, which include cluttered scenes of tabletop objects. The proposed method outperforms existing unsupervised methods for object segmentation by a large margin.
翻译:本文提出了一种自监督学习系统,用于在RGB图像中分割刚体物体。该系统利用移动机器人搭载的摄像头采集的静态物体无标注RGB-D视频进行训练。自监督训练过程的核心是一种图匹配算法,该算法作用于从每段视频重建的点云超分割输出。结合点云配准,该图匹配算法能够跨视频发现重复出现的物体模式,并将其合并为3D物体伪标签,即便在遮挡或不同视角下依然有效。从3D伪标签投影得到的2D物体掩码,通过对比学习用于训练像素级特征提取器。在线推理阶段,聚类方法利用学习到的特征将前景像素聚类为物体分割区域。实验表明,该方法在包含桌面物体杂乱场景的真实与合成视频数据集上均具有显著效果,且相比现有无监督物体分割方法取得了大幅性能提升。