Pseudo depth maps are depth map predicitions which are used as ground truth during training. In this paper we leverage pseudo depth maps in order to segment objects of classes that have never been seen during training. This renders our object segmentation task an open world task. The pseudo depth maps are generated using pretrained networks, which have either been trained with the full intention to generalize to downstream tasks (LeRes and MiDaS), or which have been trained in an unsupervised fashion on video sequences (MonodepthV2). In order to tell our network which object to segment, we provide the network with a single click on the object's surface on the pseudo depth map of the image as input. We test our approach on two different scenarios: One without the RGB image and one where the RGB image is part of the input. Our results demonstrate a considerably better generalization performance from seen to unseen object types when depth is used. On the Semantic Boundaries Dataset we achieve an improvement from $61.57$ to $69.79$ IoU score on unseen classes, when only using half of the training classes during training and performing the segmentation on depth maps only.
翻译:伪深度图是在训练过程中用作真实值的深度图预测。本文利用伪深度图来分割训练中从未见过的类别的物体,从而使我们的物体分割任务成为开放世界任务。伪深度图通过预训练网络生成,这些网络要么经过完全旨在泛化到下游任务的训练(如LeRes和MiDaS),要么以无监督方式在视频序列上训练(如MonodepthV2)。为了告知网络需要分割哪个物体,我们向网络提供在图像伪深度图上物体表面的单次点击作为输入。我们在两种场景下测试了我们的方法:一种是不包含RGB图像,另一种是将RGB图像作为输入的一部分。结果表明,当使用深度信息时,从已见到未见物体类型的泛化性能显著提升。在Semantic Boundaries数据集上,当我们仅使用一半的训练类别进行训练并仅在深度图上执行分割时,未见类别的IoU分数从61.57提升到69.79。